Data Processing

We copy log data from our distributed data centers to a central one continuously and process them every night in parallel on a cluster of machines. Unlike traditional media measurement, no sampling or extrapolation is involved; our core traffic statistics are based on an exact, complete count of events received.

Filtration

We strive for maximum transparency in our reporting, so wherever practical we classify data rather than filtering it. Our processing accepts any event with enough information for us to make sense of, minimally the URL of the page bearing the tag and the ID of the publisher who placed it. We do filter data that isn’t minimally intelligible as well as traffic on Quantified sites originating from Quantcast.

We classify and report separate totals for traffic likely due to non-human activity, using a combination of the IAB’s bots and spiders list and our own proprietary techniques. Note that the nature of what generated a given web page request cannot be known for certain, so our estimates of such traffic are only approximate. We neither filter nor report separately on prefetch activity, nor on internal traffic—-that is, visits made by employees of Quantified Publisher to their own site.

How We Define Properties

Although every event we receive identifies the URL of the web page or other media bearing our tag, generally traffic must be aggregated into larger entities to be useful. Our aggregation model is flexible and evolving but includes the following rollups:

  • Domains and subdomains. These are the most straightforward aggregations, coming directly from the URL of the page bearing our tag. We attribute traffic according to the domain hierarchy. A visitor to a.mysite.com and b.mysite.com will also count as visiting mysite.com, but only as one visitor.
  • Quantified networks. Many publishers own or distribute content across multiple sites. We report on their aggregate traffic in a single network profile that shows the combined, unduplicated audience of all media tagged with their p-code.
  • Media labels. Many publishers classify traffic, either for internal analysis or to channelize inventory for sale, along dimensions other than domains and subdomains. For example a newspaper grouping its print inventory by travel, sports, and business channels might want to report online audience along the same dimensions. Publishers can do so by setting a media label within their tag indicating to what channel or channels traffic should be credited. We report on each channel’s audience just as if it were a web site of its own.
  • Sites vs. distributors. Virtually all online publishers have content distributed through sites they don’t own. Ad networks, publishers of embeddable widgets, Facebook apps, or videos are obvious examples, but even the content of lone blogs is often scanned, cached and redistributed through search engines, for example via the such as the “cached” links ion Google’s search results. Quantified network profiles include the whole audience consuming a publisher’s content, whether they view it on the publisher’s own sites or elsewhere. But within a network we report totals for direct and distributed traffic separately.