Building on Puppeteer: Finding a way beyond PhantomJS

Why we finally need to move on
Almost every full-stack or front-end engineer has some sort of experience with PhantomJS. PhantomJS is heavily used for integration tests, website monitoring, data scraping and a lot more. However, the last release of Phantom was back in January 2016 and it was starting to show its age. It could not reliably run sites using ES6 and other modern technologies. And while the few, hard working, contributors were trying hard to keep it up to date, it became too much. On August 2017, the repository was officially declared abandoned.

Here at Quantcast, we run the world’s largest AI-driven audience insights and measurement platform directly measuring over 100MM web destinations. The backbone of our platform is Quantcast Measure. The first steps to getting started with Measure is to register your site on quantcast.com and put the supplied javascript tag on your web pages. This javascript tag provides us with multidimensional information about the users who visit your site. Because of the importance of this tag, we need to be able to quickly verify that the tag has been properly placed on your site. Our original scanner written in 2006 just grabbed the page HTML and searched it for the tag. This started to fail with the advent of tag managers. About 2 years ago we built a second iteration of the scanner used PhantomJS and now once again we find ourselves in need of a new scanner due to more and more failing scans.

Tooling options for a site scanner
Having a service only last for under two years after a major rewrite is not ideal. Generally a product like this we would want to last 3-4 years before having to revisit it again. One of the main requirements for the next iteration would be the use of a well supported and maintained headless browser. We would also like it to be as lightweight and easy to use as possible. It would also need to be able to monitor all page resources. After a lot of searching around we came up with three options; Electron, Chromeless and Puppeteer.

null

Electron is built on top of Chromium and NodeJS so we were hoping that they would abstract away most of the nitty gritty Chromium details. It looks to be well maintained and easy to install/use. However we found that it contains a whole lot more functionality than we actually need. There are a few tools people have written on top of it, like NightmareJS, that make the usage a lot simpler, but it doesn’t have resource monitoring we need. At this writing it also lacks a truly headless mode so we would need an additional graphics environment dependency such as xvfb.

Chromeless is another popular automated browser tool. It is pretty new and so well maintained. It has a very simple API to that is very similar to that of the “headful” NightmareJS. However it is more geared toward crawling, automated testing, and screenshots. It doesn’t really give us the detailed resource level data we need for this project.

Finally we found Puppeteer. Puppeteer is what the PhantomJS maintainer recommend moving toward. It is also written on top of Chromium utilizing its new “headless” mode. It is maintained by Chrome DevTools team and used internally by Google. This gives us the confidence that it will be around for a long time. Plus its APIs are strikingly similar to that of PhantomJS so we know it will handle what we want. It depends on a rather large Chromium install, but all the configurations are abstracted away. This tool seems to fit our requirements almost perfectly.

Build/Code
We have a seemingly simple set of requirements. Given a URL, return any Quantcast tags firing on the page. Puppeteer’s APis have a very familiar pattern to normal web browsing. First you launch the browser and then open individual tabs, or pages as Puppeteer calls them. A browser can have multiple pages and each can be reused for multiple requests. Because they are being reused we want to prevent the browser from caching anything so it doesn’t affect future page loads. We have found that these options seem to do the trick.

const puppeteer = require(‘puppeteer’);

let browserPromise = puppeteer.launch(
     {timeout:60000, args:[‘–aggressive-cache-discard’,’–disable-cache’, ‘–disable-application-cache’]}
);

For each page we need to set up an event listener to listen for all the requests made during a given page load. In our case we are specifically looking for finished Quantcast measure pixel request which comes from `pixel.quantserve.com`

page.on(‘requestfinished’, request => {
        let resp = request.response();
        if (resp && resp.url.indexOf(‘pixel.quantserve.com/pixel’) > -1) {
             tags.push({tag: resp.url, code: resp.status});
        }

});

Once we have that in place we can simply tell the page to goto the url we want to check. After a few trial scans we began to notice that our tag was not always being caught. We quickly realized that because our tag is asynchronous, so as not to slow down the load time of users’ pages, sometimes the page load event was fired before our tag was finished. Luckily, Puppeteer comes with a way to wait until the network becomes idle or semi-idle. To prevent pages that automatically update (such as Twitter) from never completing, we will use the semi-idle event. So let’s try again.

page.goto(url, {waitUntil: [‘load’, ‘networkidle2’]});

This turned out to be much better. Scans are succeeding on a regular basis, but the next thing issue we noticed is that some scans would be a little slow or take a large amount of memory. A quick fix we can implement is to not actually load any images (except our image tag). Puppeteer allows us to do this by intercepting any request before it happens and aborting it.

page.setRequestInterception(true);

page.on(‘request’, request => {
   if( request ) {
         if (request.resourceType === ‘image’ && request.url.indexOf(‘pixel.quantserve.com/pixel’) === -1) {
              request.abort();
          } else {
              request.continue();
          }
     }
});

The final issue we ran into was sometimes pages just don’t load in time. We are giving pages a full minute to load up, but time is money and we can’t just wait around forever. In that case we need to catch the case where the page throws a timeout and see if any of our tags have fired. If so we can just return them and if not then we should record the fact that the page timed out for further investigation.

try {                                                                                          
     const resp = await page.goto(url, {waitUntil: [‘load’, ‘networkidle2’]});
     result = {tags: tags, code: resp.status};
 } catch (err) {
     // tags ended up loading, but page technically did not load successfully
     if(tags.length > 0){
         result = {tags: tags, code: -1};
     }else {
         result = {error: err.message};
      }
}

If you would like to see the full scanning script we have posted it as a Github Gist here.

Outcomes
The results of the new site scanner has been better than we had expected. Of the 300 outstanding scans, 25 previously missed tags were found immediately. Those were 25 potential customers that we would have missed and most likely would have resulted in them not getting the best use out of our product. Most scans also succeed in under 2 seconds compared to the 8 seconds from PhantomJS. The downside is that Puppeteer is a lot heavier in terms of CPU and more complicated to set up in the cloud than PhantomJS because of its dependency on Chromium. This was expected and it a small price to pay for a feature complete browser allowing us the most accurate scanning. Since its release we have drastically reduced the number of support emails related to site scanning. Once again our engineers don’t have to waste time on the tedious task of manually checking and adding missed sites. Lets hope Puppeteer continues to be maintained well into the future.

Where we go from here
As you can imagine there is still a lot of room for improvements. For one, we can look into re-using page objects to try to reduce CPU usage and squeeze a few more scans out of the box. Another step is to move the scanner to the serverless model using AWS Lambdas or similar technology. This would allow us to parallelize scans almost indefinitely without having to incur any cost of leaving hardware idle. Regardless of the direction we go, Puppeteer ended up being a great fit for our needs and very easy to use. Because its APIs were so similar to PhantomJS we could quickly replace our existing site scanning script with this one in a matter of days. On top of that it gives us the piece of mind that it will continue to work and be supported for years to come.

Quantcast