Web scraping tutorial


There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs.

In this post we will take a quick look at writing a simple scraperusing the simplehtmldom library. But before we continue a word of caution:

Writing screen scrapers and spiders that consume large amounts of bandwidth, guess passwords, grab information from a site and use it somewhere else may well be a violation of someone’s rights and will eventually land you in trouble. Before writing  a screen scraper first see if the website offers an RSS feed or an API for the data you are looking. If not and you have to use a scraper, first check the websites policies regarding automated tools before proceeding.

Now that we have got all the legalities out of the way, lets start with the examples.

1. Installing simplehtmldom.
Simplehtmldom is a PHP library that facilitates the process of creating web scrapers. It is a HTML DOM parser written in PHP5 that let you manipulate HTML in a quick and easy way. It is a wonderful library that does away with the messy details of regular expressions and uses CSS selector style DOM access like those found in jQuery.

First download the library from sourceforge.  Unzip the library in you PHP includes directory or a directory where you will be testing the code.

Writing our first scraper.
Now that we are ready with the tools, lets write our first web scraper. For our initial idea let us see how to grab the sponsored links section from a google search page.

google search





Before we can retrieve the required data we need to know the HTML structure of the page so that we can know precisely where the required information is located.

firebug

The ‘li’ tags contain the actual data we are after.  With this little information we can use the following code to read all the sponsored links.

$data =  $html->find('td[id=rhsline]');
echo $data[0]->children(1);

The complete source is given below.

<?php
 
/* update your path accordingly */
include_once 'libs/simplehtmldom/simple_html_dom.php';
 
$search_term = "mobiles";
 
$url = "http://www.google.co.in/search?hl=en&q={$search_term}";
 
$html = file_get_html($url);
 
/*
Get all table rows having the id attribute named 'rhsline'.
As the list of sponsored links is in the 'ol' tag; as can be
seen from the DOM tree above; we use the 'children' function
on the $data object to get the sponsored links.
*/
$data =  $html->find('td[id=rhsline]');
 
/*
  Make sure that sponsors ads are available,
  Some keywords do not have sponsor ads.
*/
if(isset($data[0]))
    echo $data[0]->children(1);
 
?>

In the next example we will grab the list of contents from the latest Wired magazine issue.

 
$ret =  $html->find('div[id=this_month] div[class=story]');
 
foreach($ret as $story)
    echo $story->find('a', 0) . "<br>";

This will return all the content section links from the page. To just return the links as text we can use the ‘plaintext’ modifier.

echo $story->find('a', 0)->plaintext . "<br>";

The complete source is given below.

<?php
 
/* update your path accordingly */
include_once 'libs/simplehtmldom/simple_html_dom.php';
 
$url = "http://www.wired.com/wired/";
 
$html = file_get_html($url);
 
$ret =  $html->find('div[id=this_month] div[class=story]');
 
foreach($ret as $story)
    echo $story->find('a', 0)->plaintext . "<br>";
 
?>

Now that you have seen how simple it is to scrape a web page you can read the simplehtmldom manual to get more details on the various functions available in the library and start creating your own web scrapers.

I’m available for various kinds of web scrapping projects.
Contact me with details to ‘metapix [at] gmail.com’ for price quote. My hourly rates are given here.

This site is a digital habitat of Sameer Borate, a freelance web developer working in PHP, MySQL and WordPress. I also provide web scraping services, website design and development and integration of various Open Source API's. Contact me at metapix[at]gmail.com for any new project requirements and price quotes.

22 Responses

1

jyf1987

March 8th, 2009 at 7:55 am

well,i found a class on phpclass named phpquery
which use like jQuery

2

Joqi

March 9th, 2009 at 9:02 am

Thanks, very useful tool..

3

Sameer Borate’s Blog: Web scraping tutorial : Dragonfly Networks

March 10th, 2009 at 12:31 am

[...] a new tutorial on his blog today, Sameer shows a library that you can use (simplehtmldom) to parse remote sites [...]

4

Hans

April 5th, 2009 at 12:43 pm

Please show the full source !!

sameer

April 5th, 2009 at 11:48 pm

I’ve updated the code above.

6

Making fake programs is fun :) - Page 6 - PreCentral Forums

July 6th, 2009 at 7:30 pm

[...] web scraping tutorial Web scraping tutorial : CodeDiesel [...]

7

Akshay

July 12th, 2009 at 10:13 pm

I have created an easy to use web scraper in the form of a WordPress plugin. It uses cURL and phpQuery (for parsing). It also provides with some output functions like clear, find and replace, output (text / html) and caching and error handling capabilities. Here’s the link – http://wordpress.org/extend/plugins/wp-web-scrapper/

8

PHP Coder

August 19th, 2009 at 7:48 am

Tips: if you write web scraper, its likely that the scraper function will repeats. for example: fetching page1.html, page2.html, …, page10.html. in that case, set enough delay between request or you’ll be kicked off the site as they see you as spam.

9

Joe Duggins

August 29th, 2009 at 8:46 pm

This was really helpful. I just started using this set of php classes, and I’m happy to have a good spot to begin, good work.

10

funkuncut

February 14th, 2010 at 3:28 am

Awesome! this is super useful

11

Subramanyam Srikanth

March 3rd, 2010 at 10:13 pm

Nice post. Can any one tell me how can we scrap through javascript since i need it.

12

Sid

April 11th, 2010 at 7:55 pm

Excellent post.. I’m looking for some help with scraping javascript. I’ve been trying to use YQL, but in vain. Could anybody please help me out…

13

Sushil

May 3rd, 2010 at 1:08 pm

Hi…

I am listen about the web scraping from my friend but i have some queries please explain this

Is web scraping is a part of hacking or hacking is possible using scraping

Is it legal Procedure ?

What’s the career and future scope in this.

sameer

May 3rd, 2010 at 10:20 pm

Hi Sushil!

Web scraping is not hacking. Web scraping is just a technique to gather text information from a web page, just like web robots do. Many sites these days provide APIs to access information from their sites.

Regarding the legal issues, it depends on the particular sites ‘Terms & conditions’. Check the ‘Legal Issues’ section of this article :

http://en.wikipedia.org/wiki/Web_scraping

If you want to make a career in scraping, do it in a broad way. That is also learn about text processing, text mining. These are all related fields. But you will have to market your skills yourself. Most companies don’t advertise for such skills, even though they may need them.

The best way to get started in scraping is to learn Perl and Regular Expressions. You can do web scraping in PHP, but its like digging a hole with a screwdriver; it can be done, but it will take a hell longer then if you would have used a shovel. Simple scraping in PHP is ok, but for complex things nothing beats Perl. There are loads of libraries in Perl for the same.

Even if after some time your career in scraping does not pan out, you will have learned a whole lot about text processing and regular expressions, which can be invaluable in any software field.

15

JenniC

May 10th, 2010 at 12:57 pm

Nice discussion.

I use web scraping on pages from our own website. I use this script.

http://www.biterscripting.com/helppages/SS_WebPageToText.html

It’s pretty simple to use.

16

Pearls

May 29th, 2010 at 5:49 am

Nice post. I will bookmark it.

17

petric

August 27th, 2012 at 5:40 am

Hi,

I am using this concept to fetch flight scheduled from airlines site to my site its working also but when i get some airlines site url like htttp://www.someairline.com/schedule.cgi , i am not able to fetch any details coz there is no parameter passed in URL so in such case how to do web scrapping in PHP ?????

Please Help me,

Thanks..

18

Cliff

April 1st, 2013 at 1:49 pm

I hope this thread is still being monitored.

I have set up a test using the exact same code as above.

First run and I get this:
Call to undefined function mb_detect_encoding() in /home/vision/public_html/gunbot/libs/simplehtmldom/simple_html_dom.php on line 1234

Any I deas?
I’m under a deadline.
Thanks ahead of time.

sameer

April 5th, 2013 at 6:29 am

Make sure that the ‘php_mbstring’ extension is enabled in your php.ini.

20

software engineer

April 9th, 2013 at 7:49 am

Crappy. What if the pages code change? We need semantic web enabled architecture.

21

Sawan

July 14th, 2014 at 5:21 am

Awesome article…..you can also do scraping easily using html dom parser library…..
here is sample code…….

Scrape All Links

load_file(“http://www.google.com”);

// This will Find all links
foreach($html->find(‘a’) as $element)
echo $element->href . ”;

?>

22

Dharmesh Hadiyal

August 1st, 2014 at 6:29 am

hello Sushil,
web scraping is legal process not hacking.
for Biggest Example of web Scraping google.
Google Collect data from various website.

if any job related web scraping then contact me.
(mr.dharmeshhadiyal@gmail.com)

thanks

Your thoughts