/* PHP & MySQL Journal */
There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs.
In this post we will take a quick look at writing a simple scraperusing the simplehtmldom library. But before we continue a word of caution:
Writing screen scrapers and spiders that consume large amounts of bandwidth, guess passwords, grab information from a site and use it somewhere else may well be a violation of someone’s rights and will eventually land you in trouble. Before writing a screen scraper first see if the website offers an RSS feed or an API for the data you are looking. If not and you have to use a scraper, first check the websites policies regarding automated tools before proceeding.
Now that we have got all the legalities out of the way, lets start with the examples.
1. Installing simplehtmldom.
Simplehtmldom is a PHP library that facilitates the process of creating web scrapers. It is a HTML DOM parser written in PHP5 that let you manipulate HTML in a quick and easy way. It is a wonderful library that does away with the messy details of regular expressions and uses CSS selector style DOM access like those found in jQuery.
First download the library from sourceforge. Unzip the library in you PHP includes directory or a directory where you will be testing the code.
2. Installing FireBug.
I assume you have FireBug already installed, if not head over here for the installation. You will soon see why it is required.
3. Writing our first scraper.
Now that we are ready with the tools, lets write our first web scraper. For our initial idea let us see how to grab the sponsored links section from a google search page.

Before we can retrieve the required data we need to know the HTML structure of the page so that we can know precisely where the required information is located. For this purpose FireBug is quite a handy tool. Open FireBug and click on the inspect button to check the Dom structure of the page. A sample image is shown below.
As can be seen from the FireBug pane the whole sponsored links section is inside a <td> tag with a id named ‘rhsline’. The following is a Dom representation of the ‘rhsline’ node.
The ‘li’ tags contain the actual data we are after. With this little information we can use the following code to read all the sponsored links.
$data = $html->find('td[id=rhsline]'); echo $data[0]->children(1); |
The complete source is given below.
<?php /* update your path accordingly */ include_once 'libs/simplehtmldom/simple_html_dom.php'; $search_term = "mobiles"; $url = "http://www.google.co.in/search?hl=en&q={$search_term}"; $html = file_get_html($url); /* Get all table rows having the id attribute named 'rhsline'. As the list of sponsored links is in the 'ol' tag; as can be seen from the DOM tree above; we use the 'children' function on the $data object to get the sponsored links. */ $data = $html->find('td[id=rhsline]'); /* Make sure that sponsors ads are available, Some keywords do not have sponsor ads. */ if(isset($data[0])) echo $data[0]->children(1); ?> |
In the next example we will grab the list of contents from the latest Wired magazine issue.
$ret = $html->find('div[id=this_month] div[class=story]'); foreach($ret as $story) echo $story->find('a', 0) . "<br>"; |
This will return all the content section links from the page. To just return the links as text we can use the ‘plaintext’ modifier.
echo $story->find('a', 0)->plaintext . "<br>"; |
The complete source is given below.
<?php /* update your path accordingly */ include_once 'libs/simplehtmldom/simple_html_dom.php'; $url = "http://www.wired.com/wired/"; $html = file_get_html($url); $ret = $html->find('div[id=this_month] div[class=story]'); foreach($ret as $story) echo $story->find('a', 0)->plaintext . "<br>"; ?> |
Now that you have seen how simple it is to scrape a web page you can read the simplehtmldom manual to get more details on the various functions available in the library and start creating your own web scrapers.
|
|
This site is a digital habitat of Sameer, a freelance web developer working from Pune.More
11 Responses
1
jyf1987
March 8th, 2009 at 7:55 am
well,i found a class on phpclass named phpquery
which use like jQuery
2
Joqi
March 9th, 2009 at 9:02 am
Thanks, very useful tool..
3
Sameer Borate’s Blog: Web scraping tutorial : Dragonfly Networks
March 10th, 2009 at 12:31 am
[...] a new tutorial on his blog today, Sameer shows a library that you can use (simplehtmldom) to parse remote sites [...]
4
Hans
April 5th, 2009 at 12:43 pm
Please show the full source !!
sameer
April 5th, 2009 at 11:48 pm
I’ve updated the code above.
6
Making fake programs is fun :) - Page 6 - PreCentral Forums
July 6th, 2009 at 7:30 pm
[...] web scraping tutorial Web scraping tutorial : CodeDiesel [...]
7
Akshay
July 12th, 2009 at 10:13 pm
I have created an easy to use web scraper in the form of a WordPress plugin. It uses cURL and phpQuery (for parsing). It also provides with some output functions like clear, find and replace, output (text / html) and caching and error handling capabilities. Here’s the link - http://wordpress.org/extend/plugins/wp-web-scrapper/
8
PHP Coder
August 19th, 2009 at 7:48 am
Tips: if you write web scraper, its likely that the scraper function will repeats. for example: fetching page1.html, page2.html, …, page10.html. in that case, set enough delay between request or you’ll be kicked off the site as they see you as spam.
9
Joe Duggins
August 29th, 2009 at 8:46 pm
This was really helpful. I just started using this set of php classes, and I’m happy to have a good spot to begin, good work.
10
funkuncut
February 14th, 2010 at 3:28 am
Awesome! this is super useful
11
Subramanyam Srikanth
March 3rd, 2010 at 10:13 pm
Nice post. Can any one tell me how can we scrap through javascript since i need it.