Using PhantomJS to headlessly analyze web pages


One of this year’s most interesting open source projects has been PhantomJS – a headless WebKit with JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. I’ve already used it to automatically scrape data from pages and for testing. In this post we will use PhantomJS along with confess.js to analyze web page performance. confess.js currently has two main functions: to provide simple page performance profiles, and to generate a app cache manifests.

While this might seem a poor quality alternative to the rich UI diagnostics tools available, it provides a nice way to get a quick overview of the performance profile, and potential bottlenecks within a page. And more importantly, it can be easily extended, run from the command line, automated, and integrated with others scripts.

Once installed, the quickest thing we can do with confess.js is generate a simple performance report of a given page. The format of the command is shown below – where URL is the required url of the web page to analyze, TASK is what you want the tool to do and CONFIG the location of an alternative configuration file, if you don’t want to use the default config.json. Currently there are three tasks options – performance, appcache, cssproperties .

phantomjs confess.js URL TASK [CONFIG]

Performance Task

The command to analyze the performance of a page is shown below. Using the performance task argument will get confess.js to load the page, and then log the sizes and timings of its various parts. It will list the fastest and slowest resources, and also the largest and smallest (subject to the availability of the content-length header). With the verbose configuration enabled (default), it will also list out all the resources loaded as part of the page, and display an ASCII-art waterfall chart of their timings.

phantomjs confess.js http://www.cnn.com/ performance

Here, PhantomJS launches the confess.js script with the ‘performance’ task, loads and analyzes the cnn.com page, and then generates report something like the following:

Config:
 task: performance
 userAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12
Safari/535.11
 wait: 0
 consolePrefix: #
 verbose: true
 url: http://www.cnn.com/
 configFile: config.json
 
Elapsed load time:  14826ms
   # of resources:      125
 
 Fastest resource:      4ms; data:image/png;base64,iVBORw0KGgoAAAAN...5tIuMa+L2z+BexZXK+OBaruAAAAAElFT
kSuQmCC
 Slowest resource:   8966ms; http://i.cdn.turner.com/cnn/.e/img/3.0/global/buttons/Sprite_BT_master.gi
f
  Total resources: 367813ms
 
Smallest resource:       1b; http://ads.cnn.com/html.ng/site=cnn_in...3325696&tile=0207570641431&domId
=478084
 Largest resource:   96926b; http://z.cdn.turner.com/cnn/tmpl_asset...intl_homepage/730/css/intlhplib-
min.css
  Total resources:  562904b; (at least)

You also get a text-based waterfall diagram and timing information for various resources.

 8:    638ms;     858b; http://i.cdn.turner.com/cnn/.e/img/3.0/search/btn_search_hp_text.gif
 9:    639ms;     138b; http://i.cdn.turner.com/cnn/.e/img/3.0/global/icons/video_icon.gif
10:    640ms;      43b; http://i.cdn.turner.com/cnn/.e/img/3.0/1px.gif
11:    646ms;      94b; http://i.cdn.turner.com/cnn/.e/img/3.0/global/misc/advertisement.gif
12:    675ms;     229b; http://i.cdn.turner.com/cnn/.element/img.../personalization/35x35_generic_avaar.gif
13:    927ms;     292b; http://i.cdn.turner.com/cnn/.element/img/3.0/personalization/close_bt.gif
14:    931ms;     889b; http://i.cdn.turner.com/cnn/.e/img/3.0/search/search_btn_footer.gif
15:    935ms;    1119b; http://i.cdn.turner.com/cnn/.e/img/3.0/global/footer/pngs/footer_google.png
16:    937ms;     475b; http://i.cdn.turner.com/cnn/.e/img/3.0/global/footer/pngs/footer_cnn_logo.png

appcache Task

Using the appcache task argument will get confess.js to load the page, and then search the DOM and the CSS Object Model (CSSOM) for references to any external resources that the app needs. It can optionally also look for resource request events.

phantomjs confess.js http://www.cnn.com/ appcache

The results go to stdout, but you can pipe it to a file for later use.

phantomjs confess.js http://www.cnn.com/ appcache > app.cache

cssproperties Task

Using the cssproperties task argument will get confess.js to load the page, and then parse the styles in the DOM and CSSOM to identify which CSS properties that are being used by the page. Note that this can sometimes return blank results for some pages or sites.

Default configuration

The default configuration file content is shown below which can be changed or a new one created and the file-name and path specified on the command line.

{
  "task": "appcache",
  "userAgent": "chrome",
  "userAgentAliases": {
    "iphone": "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X;...
    "android": "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One ...
    "chrome": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 ...
  },
  "wait": 0,
  "consolePrefix": "#",
  "verbose": true,
  "appcache": {
    "urlsFromDocument": true,
    "urlsFromRequests": false,
    "cacheFilter": ".*",
    "networkFilter": null
  }
}

This site is a digital habitat of Sameer Borate, a freelance web developer working in PHP, MySQL and WordPress. I also provide web scraping services, website design and development and integration of various Open Source API's. Contact me at metapix[at]gmail.com for any new project requirements and price quotes.

Your thoughts