Web scraping tutorial

There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs.

In this post we will take a quick look at writing a simple scraperusing the simplehtmldom library. But before we continue a word of caution:

Writing screen scrapers and spiders that consume large amounts of bandwidth, guess passwords, grab information from a site and use it somewhere else may well be a violation of someone’s rights and will eventually land you in trouble. Before writing  a screen scraper first see if the website offers an RSS feed or an API for the data you are looking. If not and you have to use a scraper, first check the websites policies regarding automated tools before proceeding.

Now that we have got all the legalities out of the way, lets start with the examples.

1. Installing simplehtmldom.
Simplehtmldom is a PHP library that facilitates the process of creating web scrapers. It is a HTML DOM parser written in PHP5 that let you manipulate HTML in a quick and easy way. It is a wonderful library that does away with the messy details of regular expressions and uses CSS selector style DOM access like those found in jQuery.

First download the library from sourceforge.  Unzip the library in you PHP includes directory or a directory where you will be testing the code.

Writing our first scraper.
Now that we are ready with the tools, lets write our first web scraper. For our initial idea let us see how to grab the sponsored links section from a google search page.

google search

Before we can retrieve the required data we need to know the HTML structure of the page so that we can know precisely where the required information is located.


The ‘li’ tags contain the actual data we are after.  With this little information we can use the following code to read all the sponsored links.

$data =  $html->find('td[id=rhsline]');
echo $data[0]->children(1);

The complete source is given below.

/* update your path accordingly */
include_once 'libs/simplehtmldom/simple_html_dom.php';
$search_term = "mobiles";
$url = "http://www.google.co.in/search?hl=en&q={$search_term}";
$html = file_get_html($url);
Get all table rows having the id attribute named 'rhsline'.
As the list of sponsored links is in the 'ol' tag; as can be
seen from the DOM tree above; we use the 'children' function
on the $data object to get the sponsored links.
$data =  $html->find('td[id=rhsline]');
  Make sure that sponsors ads are available,
  Some keywords do not have sponsor ads.
    echo $data[0]->children(1);

In the next example we will grab the list of contents from the latest Wired magazine issue.

$ret =  $html->find('div[id=this_month] div[class=story]');
foreach($ret as $story)
    echo $story->find('a', 0) . "<br>";

This will return all the content section links from the page. To just return the links as text we can use the ‘plaintext’ modifier.

echo $story->find('a', 0)->plaintext . "<br>";

The complete source is given below.

/* update your path accordingly */
include_once 'libs/simplehtmldom/simple_html_dom.php';
$url = "http://www.wired.com/wired/";
$html = file_get_html($url);
$ret =  $html->find('div[id=this_month] div[class=story]');
foreach($ret as $story)
    echo $story->find('a', 0)->plaintext . "<br>";

Now that you have seen how simple it is to scrape a web page you can read the simplehtmldom manual to get more details on the various functions available in the library and start creating your own web scrapers.

I’m available for various kinds of web scrapping projects.
Contact me with details to ‘metapix [at] gmail.com’ for price quote. My hourly rates are given here.

26 thoughts on “Web scraping tutorial

  1. Tips: if you write web scraper, its likely that the scraper function will repeats. for example: fetching page1.html, page2.html, …, page10.html. in that case, set enough delay between request or you’ll be kicked off the site as they see you as spam.

  2. Nice post. Can any one tell me how can we scrap through javascript since i need it.

  3. Excellent post.. I’m looking for some help with scraping javascript. I’ve been trying to use YQL, but in vain. Could anybody please help me out…

  4. Hi…

    I am listen about the web scraping from my friend but i have some queries please explain this

    Is web scraping is a part of hacking or hacking is possible using scraping

    Is it legal Procedure ?

    What’s the career and future scope in this.

  5. Hi Sushil!

    Web scraping is not hacking. Web scraping is just a technique to gather text information from a web page, just like web robots do. Many sites these days provide APIs to access information from their sites.

    Regarding the legal issues, it depends on the particular sites ‘Terms & conditions’. Check the ‘Legal Issues’ section of this article :


    If you want to make a career in scraping, do it in a broad way. That is also learn about text processing, text mining. These are all related fields. But you will have to market your skills yourself. Most companies don’t advertise for such skills, even though they may need them.

    The best way to get started in scraping is to learn Perl and Regular Expressions. You can do web scraping in PHP, but its like digging a hole with a screwdriver; it can be done, but it will take a hell longer then if you would have used a shovel. Simple scraping in PHP is ok, but for complex things nothing beats Perl. There are loads of libraries in Perl for the same.

    Even if after some time your career in scraping does not pan out, you will have learned a whole lot about text processing and regular expressions, which can be invaluable in any software field.

  6. Hi,

    I am using this concept to fetch flight scheduled from airlines site to my site its working also but when i get some airlines site url like htttp://www.someairline.com/schedule.cgi , i am not able to fetch any details coz there is no parameter passed in URL so in such case how to do web scrapping in PHP ?????

    Please Help me,


  7. I hope this thread is still being monitored.

    I have set up a test using the exact same code as above.

    First run and I get this:
    Call to undefined function mb_detect_encoding() in /home/vision/public_html/gunbot/libs/simplehtmldom/simple_html_dom.php on line 1234

    Any I deas?
    I’m under a deadline.
    Thanks ahead of time.

  8. Crappy. What if the pages code change? We need semantic web enabled architecture.

  9. Awesome article…..you can also do scraping easily using html dom parser library…..
    here is sample code…….

    Scrape All Links


    // This will Find all links
    foreach($html->find(‘a’) as $element)
    echo $element->href . ”;


  10. hello Sushil,
    web scraping is legal process not hacking.
    for Biggest Example of web Scraping google.
    Google Collect data from various website.

    if any job related web scraping then contact me.


  11. Nice example. Frankly I wouldn’t have thought PHP could be considered as a tool for web scraping. Then again it might make sense when you want to incorporate external web content in your own website. Still I am not sure whether it wouldn’t be better to run the scraper as a background process (e.g. in Python) to collect and store the data instead of using PHP to make these calls every single time. Another alternative would be a client side Javascript

    I wrote a tutorial of my own on web scraping – here I would consider using Scrapy or some other Python script.

  12. Thanks for the tutorial, very useful.

    Regarding the Google search, how can you add more than one search term and echo the results on the same page. So for example you would have the results for mobile, camera, ipad etc.?

  13. Hi Mr Sameer,

    I have thousands of iPhone IMEI and I need to know if it is activated already and And the activation date using website https://selfsolve.apple.com. What I do is I search one-by-one and do copy and paste but it is time consuming. Can you help me?

    Example of IMEI: 359249062928270

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>