Taking website screenshots using PhantomJS

In one of my previous posts we saw how we could take webshots of web pages using ‘wkhtmltoimage‘ toolkit. Now we have something more flexible with PhantomJS. PhantomJS is what we call a headless WebKit with JavaScript. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. You can programmatically access web page content for scraping, monitoring or testing purposes, including webpage features that work with JavaScript, which is not possible using server side languages. For most reasons you can think of it as a browser without a window, where we drive it via JavaScript. Here we are using it to grab the webshots of web pages.

Installation

PhantomJS is availabele for Mac, Linux and Windows as a compressed package. You only need to download it to a directory and run it from there or add to your system path. A sample code to grab the post titles from ‘smashing-magazine.com’ is given below.

example.js

console.log('Loading a web page');
 
var page = new WebPage();
var url = "http://www.smashing-magazine.com/";
page.open(url, function (status) {
    if(status == 'success') {
       var results = page.evaluate(function() {
            var allParas = document.getElementsByTagName("article");
            var num = allParas.length;
            var title = new Array();
 
            for(var i=0; i < num; i++) {
              title[i] = allParas[i].childNodes[1].childNodes[0].innerHTML;
            }
 
            return title;
        });
 
        for(var i=0; i < results.length; i++) {
          console.log(results[i]) + "\n";
        }
    }
    phantom.exit();
});

You can run the above using PhantomJS from the command line.

C:\phantomjs>phantomjs example.js

Grabbing your first webshot

A quick way to get a webshot of a page is to run the following. The code is given below and is also included with PhantomJS. Most of it deals with error detection and setup, while the main task of rendering the image is done by the function ‘page.render()’.

rasterize.js

var page = require('webpage').create(),
    system = require('system'),
    address, output, size;
 
    address = system.args[1];
    output = system.args[2];
    page.viewportSize = { width: 2000 height: 600 };
    if (system.args.length === 3 && system.args[1].substr(-4) === ".pdf") {
        size = system.args[2].split('*');
 
        page.paperSize = size.length === 2 ? 
        { width: size[0], height: size[1], margin: '0px' } : 
        { format: system.args[2], orientation: 'portrait', margin: '1cm' };
    }
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
        } else {
            window.setTimeout(function () {
                page.render(output);
                phantom.exit();
            }, 200);
        }
    });

Now all we need to do is run the code with PhantomJS, this will take a webshot of cnn.com and save it to a file ‘cnn-webshot.png’.

C:\phantomjs>phantomjs rasterize.js  http://www.cnn.com  cnn-webshot.png

Below is an example webshot of codediesel.com created using PhantomJS.

Note that you will need to specify the ‘page.viewportSize’ correctly. Setting it to a smaller value can cut-off the webpage rendering in between, or if the site offers responsive design, return a page optimized for a mobile devices. A safe value is to set it to 2000 as done in the above example.

Currently you can save the image in a PNG, JPEG or PDF format, with the appropriate extension specified.

You can disable page image loading during rendering using the following.

C:\phantomjs>phantomjs rasterize.js --load-images=no http://www.cnn.com  cnn-webshot.png

Some more interesting options which can be passed along the command line are given below.

--cookies-file=/path/to/cookies.txt 
pecifies the file name to store the persistent cookies. 
 
--disk-cache=[yes|no] 
enables disk cache (at desktop services cache storage location
default is 'no'). 
 
--ignore-ssl-errors=[yes|no] 
ignores SSL errors, such as expired or self-signed certificate errors 
(default is no). 
 
--load-images=[yes|no] 
load all inlined images (default is 'yes'). 
 
--local-to-remote-url-access=[yes|no] 
allows local content to access remote URL (default is no). 
 
--max-disk-cache-size=size 
limits the size of disk cache (in KB) 
 
--output-encoding=encoding 
sets the encoding used for terminal output (default is utf8). 
 
--proxy=address:port 
specifies the proxy server to use (e.g. --proxy=192.168.1.42:8080). 
 
--proxy-type=[http|socks5] 
specifies the type of the proxy server. 
 
--script-encoding=encoding 
sets the encoding used for the starting script (default is utf8). 
 
--web-security=[yes|no] 
disables web security and allows cross-domain XHR (default is yes)

7 thoughts on “Taking website screenshots using PhantomJS

  1. Thanks for this post.
    I just wanted to pass along that I got a “parse error” from PhantomJS for “rasterize.js” until I added a comma after “2000” in this line:

    page.viewportSize = { width: 2000 height: 600 };

  2. Awesome post to get started, thanks for taking the time.

    Are you able to get phantom to only take a picture of a particular DIV and all of it’s contents but still render the whole page?

    Basically I’ve built a tool that previews websites in various devices, and I want to give the user the ability to save the image. The section is made up from 4 iframes that use the either a get variable or an input value to upate the src values.

    Any more details/tutorials you know about that could help would be great.

  3. How do you pass url arguments and hit more than one url ? I always get parsing errors when I try use any url arguments (even html encoded values)

Comments are closed.