Web scraping tutorial


Posted in: php, tools | Save to del.icio.us | Twit This! 7 Mar 2009

scrapeThere are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs.

In this post we will take a quick look at writing a simple scraperusing the simplehtmldom library. But before we continue a word of caution:

Writing screen scrapers and spiders that consume large amounts of bandwidth, guess passwords, grab information from a site and use it somewhere else may well be a violation of someone’s rights and will eventually land you in trouble. Before writing  a screen scraper first see if the website offers an RSS feed or an API for the data you are looking. If not and you have to use a scraper, first check the websites policies regarding automated tools before proceeding.

Now that we have got all the legalities out of the way, lets start with the examples.

1. Installing simplehtmldom.
Simplehtmldom is a PHP library that facilitates the process of creating web scrapers. It is a HTML DOM parser written in PHP5 that let you manipulate HTML in a quick and easy way. It is a wonderful library that does away with the messy details of regular expressions and uses CSS selector style DOM access like those found in jQuery.

First download the library from sourceforge.  Unzip the library in you PHP includes directory or a directory where you will be testing the code.

2. Installing FireBug.
I assume you have FireBug already installed, if not head over here for the installation. You will soon see why it is required.

3. Writing our first scraper.
Now that we are ready with the tools, lets write our first web scraper. For our initial idea let us see how to grab the sponsored links section from a google search page.

google search

Before we can retrieve the required data we need to know the HTML structure of the page so that we can know precisely where the required information is located. For this purpose FireBug is quite a handy tool. Open FireBug and click on the inspect button to check the Dom structure of the page. A sample image is shown below.

firebug

As can be seen from the FireBug pane the whole sponsored links section is inside a <td> tag with a id named ‘rhsline’. The following is a Dom representation of the ‘rhsline’ node.

rhsline1

The ‘li’ tags contain the actual data we are after.  With this little information we can use the following code to read all the sponsored links.

$data =  $html->find('td[id=rhsline]');
echo $data[0]->children(1);

The complete source is given below.

<?php
 
/* update your path accordingly */
include_once 'libs/simplehtmldom/simple_html_dom.php';
 
$search_term = "mobiles";
 
$url = "http://www.google.co.in/search?hl=en&q={$search_term}";
 
$html = file_get_html($url);
 
/*
Get all table rows having the id attribute named 'rhsline'.
As the list of sponsored links is in the 'ol' tag; as can be
seen from the DOM tree above; we use the 'children' function
on the $data object to get the sponsored links.
*/
$data =  $html->find('td[id=rhsline]');
 
/*
  Make sure that sponsors ads are available,
  Some keywords do not have sponsor ads.
*/
if(isset($data[0]))
    echo $data[0]->children(1);
 
?>

In the next example we will grab the list of contents from the latest Wired magazine issue.

 
$ret =  $html->find('div[id=this_month] div[class=story]');
 
foreach($ret as $story)
    echo $story->find('a', 0) . "<br>";

This will return all the content section links from the page. To just return the links as text we can use the ‘plaintext’ modifier.

echo $story->find('a', 0)->plaintext . "<br>";

The complete source is given below.

<?php
 
/* update your path accordingly */
include_once 'libs/simplehtmldom/simple_html_dom.php';
 
$url = "http://www.wired.com/wired/";
 
$html = file_get_html($url);
 
$ret =  $html->find('div[id=this_month] div[class=story]');
 
foreach($ret as $story)
    echo $story->find('a', 0)->plaintext . "<br>";
 
?>

Now that you have seen how simple it is to scrape a web page you can read the simplehtmldom manual to get more details on the various functions available in the library and start creating your own web scrapers.




Share this post

Share on Facebook
Share on Twitter
Share on StumbleUpon
Share on Delicious
Share on Digg
Share on Technorati
Share on Reddit
Feeds RSS Subscribe to site Feed

Other related posts



11 Responses

1

jyf1987

March 8th, 2009 at 7:55 am

well,i found a class on phpclass named phpquery
which use like jQuery

2

Joqi

March 9th, 2009 at 9:02 am

Thanks, very useful tool..

3

Sameer Borate’s Blog: Web scraping tutorial : Dragonfly Networks

March 10th, 2009 at 12:31 am

[...] a new tutorial on his blog today, Sameer shows a library that you can use (simplehtmldom) to parse remote sites [...]

4

Hans

April 5th, 2009 at 12:43 pm

Please show the full source !!

sameer

April 5th, 2009 at 11:48 pm

I’ve updated the code above.

6

Making fake programs is fun :) - Page 6 - PreCentral Forums

July 6th, 2009 at 7:30 pm

[...] web scraping tutorial Web scraping tutorial : CodeDiesel [...]

7

Akshay

July 12th, 2009 at 10:13 pm

I have created an easy to use web scraper in the form of a WordPress plugin. It uses cURL and phpQuery (for parsing). It also provides with some output functions like clear, find and replace, output (text / html) and caching and error handling capabilities. Here’s the link - http://wordpress.org/extend/plugins/wp-web-scrapper/

8

PHP Coder

August 19th, 2009 at 7:48 am

Tips: if you write web scraper, its likely that the scraper function will repeats. for example: fetching page1.html, page2.html, …, page10.html. in that case, set enough delay between request or you’ll be kicked off the site as they see you as spam.

9

Joe Duggins

August 29th, 2009 at 8:46 pm

This was really helpful. I just started using this set of php classes, and I’m happy to have a good spot to begin, good work.

10

funkuncut

February 14th, 2010 at 3:28 am

Awesome! this is super useful

11

Subramanyam Srikanth

March 3rd, 2010 at 10:13 pm

Nice post. Can any one tell me how can we scrap through javascript since i need it.

Comment Form

Use the html <code> tag to insert small source code snippets

For longer code examples use http://pastie.org/.

Get latest updates by E-mail

About this blog

This site is a digital habitat of Sameer, a freelance web developer working from Pune.More

Recent Comments

  • sameer: You can try this in your templates header.php : http://pastie.org/867569 [...]
  • avanthi: I played it back by using selenium RC [...]
  • avanthi: Ohh, ok no problem, here the actual issue is with IE, when i play back in firefox it is working fine [...]
  • Veerendra: Hi sameer great plugin to filter content. I was searching this kind of filtering plugin for doing [...]
  • sameer: My apologies! I'm not conversant with SharePoint. [...]
  • avanthi: Is it possible to automate share point people picker control through selenium. When i record throug [...]
  • sameer: Check to see if the 'IDE > options > format' is set to HTML. [...]
  • sameer: Google strips any newline characters form the text. Although it does accept it with the online trans [...]

  • Users Online

    • 6 Users Online
    • 6 Guests