<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>code-diesel &#187; scraping</title>
	<atom:link href="http://www.codediesel.com/tag/scraping/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.codediesel.com</link>
	<description>/* PHP &#38; MySQL Journal */</description>
	<lastBuildDate>Thu, 02 Feb 2012 13:19:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>Web scraping tutorial</title>
		<link>http://www.codediesel.com/php/web-scraping-in-php-tutorial/</link>
		<comments>http://www.codediesel.com/php/web-scraping-in-php-tutorial/#comments</comments>
		<pubDate>Sat, 07 Mar 2009 18:02:47 +0000</pubDate>
		<dc:creator>sameer</dc:creator>
				<category><![CDATA[php]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[scraping]]></category>

		<guid isPermaLink="false">http://www.codediesel.com/?p=347</guid>
		<description><![CDATA[There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs. [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-459" style="border: none;" title="scrape" src="http://www.codediesel.com/wp-content/uploads/2009/03/scrape.jpeg" alt="scrape" width="100" height="79" />There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as <em>Web Scraping</em> is a technique of extracting information from websites using specially coded programs.</p>
<p><span id="more-347"></span></p>
<p>In this post we will take a quick look at writing a simple scraperusing the <a href="http://simplehtmldom.sourceforge.net/" target="_blank">simplehtmldom</a> library. But before we continue a word of caution:</p>
<p style="padding: 5px; background-color: #D1B1B1; border: 1px solid #000;">Writing screen scrapers and spiders that consume large amounts of bandwidth, guess passwords, grab information from a site and use it somewhere else may well be a violation of someone&#8217;s rights and will eventually land  you in trouble. Before writing  a screen scraper first see if the website offers an RSS feed or an API for the data you are looking. If not and you have to use a scraper, first check the websites policies regarding automated tools before proceeding.</p>
<p>Now that we have got all the legalities  out of the way, lets start with the examples.</p>
<p><strong>1. Installing simplehtmldom.</strong><br />
Simplehtmldom is a PHP library that facilitates the process of creating web scrapers. It is a HTML DOM parser written in PHP5 that let you manipulate HTML in a quick and easy way. It is a wonderful library that does away with the messy details of regular expressions and uses CSS selector style DOM access like those found in jQuery.</p>
<p>First download the library from <a title="simplehtmldom download" href="http://sourceforge.net/projects/simplehtmldom/" target="_blank">sourceforge</a>.  Unzip the library in you PHP includes directory or a directory where you will be testing the code.</p>
<p><strong>2. Installing FireBug.</strong><br />
I assume you have FireBug already installed, if not head over <a title="firebug installation" href="https://addons.mozilla.org/en-US/firefox/addon/1843" target="_blank">here</a> for the installation. You will soon see why it is required.</p>
<p><strong>3. Writing our first scraper.</strong><br />
Now that we are ready with the tools, lets write our first web scraper. For our initial idea let us see how to grab the sponsored links section from a google search page.</p>
<p><img class="size-full wp-image-353" style="border: 1px solid #000;" title="google_search1" src="http://www.codediesel.com/wp-content/uploads/2009/02/google_search1.gif" alt="google search" width="530" height="260" /></p>
<p>Before we can retrieve the required data we need to know the HTML structure of the page so that we can know precisely where the required information is located. For this purpose FireBug is quite a handy tool. Open FireBug and click on the inspect button to check the Dom structure of the page. A sample image is shown below.</p>
<p><a href="http://www.codediesel.com/wp-content/uploads/2009/02/firebug.gif"><img class="size-medium wp-image-357" style="border: 1px solid #000;" title="firebug" src="http://www.codediesel.com/wp-content/uploads/2009/02/firebug.gif" alt="firebug" width="500" height="262" /></a></p>
<p>As can be seen from the FireBug pane the whole sponsored links section is inside a &lt;td&gt; tag with a id named &#8216;rhsline&#8217;. The following is a Dom representation of the &#8216;rhsline&#8217; node.</p>
<p><a href="http://www.codediesel.com/wp-content/uploads/2009/02/rhsline1.gif"><img class="aligncenter size-full wp-image-369" style="border: 1px solid #000;" title="rhsline1" src="http://www.codediesel.com/wp-content/uploads/2009/02/rhsline1.gif" alt="rhsline1" width="559" height="252" /></a></p>
<p>The &#8216;li&#8217; tags contain the actual data we are after.  With this little information we can use the following code to read all the sponsored links.</p>

<div class="wp_codebox"><table><tr id="p3476"><td class="code" id="p347code6"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$data</span> <span style="color: #339933;">=</span>  <span style="color: #000088;">$html</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'td[id=rhsline]'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">echo</span> <span style="color: #000088;">$data</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">children</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>The complete source is given below.</p>

<div class="wp_codebox"><table><tr id="p3477"><td class="code" id="p347code7"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">&lt;?php</span>
&nbsp;
<span style="color: #009933; font-style: italic;">/* update your path accordingly */</span>
<span style="color: #000000; font-weight: bold;">include_once</span> <span style="color: #0000ff;">'libs/simplehtmldom/simple_html_dom.php'</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$search_term</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;mobiles&quot;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$url</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;http://www.google.co.in/search?hl=en&amp;q=<span style="color: #006699; font-weight: bold;">{$search_term}</span>&quot;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> file_get_html<span style="color: #009900;">&#40;</span><span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #009933; font-style: italic;">/*
Get all table rows having the id attribute named 'rhsline'.
As the list of sponsored links is in the 'ol' tag; as can be
seen from the DOM tree above; we use the 'children' function
on the $data object to get the sponsored links.
*/</span>
<span style="color: #000088;">$data</span> <span style="color: #339933;">=</span>  <span style="color: #000088;">$html</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'td[id=rhsline]'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #009933; font-style: italic;">/*
  Make sure that sponsors ads are available,
  Some keywords do not have sponsor ads.
*/</span>
<span style="color: #000000; font-weight: bold;">if</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">isset</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$data</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
    <span style="color: #000000; font-weight: bold;">echo</span> <span style="color: #000088;">$data</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">children</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">?&gt;</span></pre></td></tr></table></div>

<p>In the next example we will grab the list of contents from the latest <a href="http://www.wired.com/wired/" target="_blank">Wired</a> magazine issue.</p>

<div class="wp_codebox"><table><tr id="p3478"><td class="code" id="p347code8"><pre class="php" style="font-family:monospace;">&nbsp;
<span style="color: #000088;">$ret</span> <span style="color: #339933;">=</span>  <span style="color: #000088;">$html</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'div[id=this_month] div[class=story]'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">foreach</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ret</span> <span style="color: #000000; font-weight: bold;">as</span> <span style="color: #000088;">$story</span><span style="color: #009900;">&#41;</span>
    <span style="color: #000000; font-weight: bold;">echo</span> <span style="color: #000088;">$story</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'a'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">&quot;&lt;br&gt;&quot;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>This will return all the content section links from the page. To just return the links as text we can use the &#8216;plaintext&#8217; modifier.</p>

<div class="wp_codebox"><table><tr id="p3479"><td class="code" id="p347code9"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">echo</span> <span style="color: #000088;">$story</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'a'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">plaintext</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">&quot;&lt;br&gt;&quot;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>The complete source is given below.</p>

<div class="wp_codebox"><table><tr id="p34710"><td class="code" id="p347code10"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">&lt;?php</span>
&nbsp;
<span style="color: #009933; font-style: italic;">/* update your path accordingly */</span>
<span style="color: #000000; font-weight: bold;">include_once</span> <span style="color: #0000ff;">'libs/simplehtmldom/simple_html_dom.php'</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$url</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;http://www.wired.com/wired/&quot;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> file_get_html<span style="color: #009900;">&#40;</span><span style="color: #000088;">$url</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$ret</span> <span style="color: #339933;">=</span>  <span style="color: #000088;">$html</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'div[id=this_month] div[class=story]'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">foreach</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ret</span> <span style="color: #000000; font-weight: bold;">as</span> <span style="color: #000088;">$story</span><span style="color: #009900;">&#41;</span>
    <span style="color: #000000; font-weight: bold;">echo</span> <span style="color: #000088;">$story</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'a'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">plaintext</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">&quot;&lt;br&gt;&quot;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">?&gt;</span></pre></td></tr></table></div>

<p>Now that you have seen how simple it is to scrape a web page you can read the simplehtmldom <a href="http://simplehtmldom.sourceforge.net/manual.htm" target="_blank">manual</a> to get more details on the various functions available in the library and start creating your own web scrapers.</p>
<blockquote><p>
I&#8217;m available for various kinds of web scrapping projects.<br />
Contact me with details to &#8216;metapix [at] gmail.com&#8217; for price quote.
</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.codediesel.com/php/web-scraping-in-php-tutorial/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>

