PHP Simple HTML DOM Parser Script

In this post I have explained some elements to scrap data from external websites.
Simple HTML DOM parser is a PHP 5+ class which is useful to manipulate HTML elements. This class can work with both valid HTML and HTML pages that do not pass W3C validation. You can find elements by ids, classes, tags and many more. You can also add, delete or alter DOM elements. The only one thing you should care about is memory leaks – but you can avoid memory leaks as explained later.

Get Started with PHP Simple HTML DOM Parser

After uploading the class file, the simple HTML DOM class instance has to be created. There are three ways to create a DOM class:

  • Load HTML from a file
  • Load HTML from a URL
  • Load HTML from a string
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a HTML file
$html->load_file('path-to-file/example.html');
// Load HTML from an URL
$html->load_file('http://www.yourdomainname.com/');
// Load HTML from a string
 $html->load('<html><body>All the Besttttt!</body></html>');

If you use “Load HTML from a string” and want more control over HTTP request, then use CURL instead to fetch HTML to a string and after that load the DOM class object from a string.

Find HTML Elements using PHP Simple HTML DOM Parser

You can use the find function to find HTML DOM elements on the page. It returns an object or an array of an objects.

Examples:

//Find elements by tag name. Example: <p> tag. Keep in mind that it returns an array with object elements.
$p = $html->find('p');
// Find the element where the id is equal to a particular value
For example : div with id="header"
$main = $html->find('div[id=header]',0);
// Find (N)th element, where the first element is 0 and returns object or null if object not found.
$a = $html->find('a', 0);
//Query for finding elements which have attribute id
$divs = $html->find('[id]');
//Find elements that have id attribute. For example, find divs which have id attribute.
$divs = $html->find('div[id]');

Use “selectors” to find DOM Elements:

// Find all elements where id=header. Note that two elements with the same ids is not valid HTML.
$result = $html->find('#header');
// Query for finding all elements where class=container
$result = $html->find('.container');
// For finding elements by tag name
$result = $html->find('b, p');
// Find elements by tag name where certain attribute value exists For example: find all anchors and
images with the attribute title.
$result = $html->find('a[title], img[title]');

Parent, child and sibling elements selection using built-in functions:

// returns the parent of a DOM element
$result->parent;
// returns element children in an array
$result->children;
// returns a specified child
$result->children(0);
// returns first child of an element. If it’s not found then returns null
$result->first_child ();
// returns last child of an element
$result->last _child ();
// For finding previous sibling of an element
$result->prev_sibling ();
//returns next sibling of an element
$result->next_sibling ();

Attribute Operators:

With simple regular expressions, we can use different attribute selectors.

  • [attribute] – Select HTML DOM elements that have a certain attribute
  • [attribute=value] – elements which have the specified attribute with a specific value.
  • [attribute!=value]- elements which don’t have the specified attribute with a specific value.
  • [attribute*=value] – elements with the particular attribute whose value contains the specified value
  • [attribute$=value] – elements with the specified attribute whose value ends with the specified value
  • [attribute^=value] – elements with the specified attribute whose value begins with the certain

Accessing DOM Element Attributes with PHP Simple HTML DOM Parser

Attributes are actually object variables:

$link = $html->find('a',0)->href;

Each object has four attributes:

  • tag – returns the tag name
  • innertext – returns inner HTML of an element
  • outertext – returns outer HTML of an element
  • plaintext – returns plain text (without HTML tags)

Editing HTML Elements with PHP Simple HTML DOM Parser

Edit an attribute is similar to reading their values.

// Change or set attribute value
$a->href = 'http://www.yourdomainname.com';
// Remove an attribute.
$a->href = null;
// Check if attribute exists
if(isset($a->href)) {
	//do something here
}

There are no special functions to append or remove elements, but there are some methods:

// Wrap an element
$result->outertext = '<div class="wrap">' . $result->outertext . '<div>';
// Remove an element
$result->outertext = '';
// Append an element
$result->outertext = $result->outertext . '<div>header<div>';
// Insert an element
$result->outertext = '<div>header<div>' . $result->outertext;

To save the DOM document just put the DOM object into a variable:

$doc = $html;
// Display the page
echo $doc;

Prevent PHP Simple HTML DOM Parser Memory Leak

Always be careful about memory leak because it can slow your website. You can add the following line to avoid memory leaks.

$html->clear();

Happy Coding!!

About Guest Author: Ravi Makhija
A writer, an Entrepreneur. Curious about the internet of everything. Interested in the cutting edge landscape of mobile apps, SAAS products, web scraping services and latest technology trends. Founder & CEO at Guru Technolabs.

5 thoughts on “PHP Simple HTML DOM Parser Script

  1. I find PHP Dom complex for simple web scraping jobs. Although it will be relatively fast as compared to SimpleHTMLDom, the simplicity made me choose SimpleHTMLDom over the other.

Leave a Reply

Your email address will not be published. Required fields are marked *