A Brief History of Quantifying Populations
The Hollerith tabulator was invented in response to the tremendous data processing problem of the 1890 U.S. census. It was a marvel of electrical and mechanical technology, and according to Wikipedia “it spawned the data industry.”
One hundred twenty years later at Quantcast we are solving similar, though much harder data problems. They involve over a billion people and are much more complex than a census: modeling the behavior of the Internet population and comparing it to the behaviors of known buyers of products and services. In contrast to traditional advertising, we don’t group users in broad categories, but rather we make custom advertising decisions for every single Internet user with respect to each of our advertising campaigns. This allows us to be precise and accountable.
The catch is, we need to operate at Internet scale, and the Internet is vast. We trawl through an ocean of data, such as the visit patterns from billions of browsers to hundreds of millions of web destinations. The data arrives to us in chronological order as pages get visited, but we need to sort it in all kinds of ways for different uses. In our main compute cluster we spend about half of our I/O capacity sorting data. Over 90% of the data we sort is from jobs larger than 1 terabyte each. Over 10% is from jobs larger than 100 terabytes each. The largest production sort was over 400 terabytes.
We like to keep capacity ahead of needs, so one day we thought to ourselves, “we ought to sort a petabyte.”
The Scale Challenge
The only public records of sorting at petabyte scale so far came out of Google and Yahoo, because it’s a hard problem. Although sort algorithms such as quicksort were perfected long ago, they assume the data will fit in a computer’s RAM where it can be read and written very quickly. But when data doesn’t fit in one computer’s RAM you need to use external storage, which generally means spinning disks. These are much slower than RAM because you’re not just flipping transistors but sliding disk heads and waiting for disks to spin to the right position. A century after Hollerith, the largest sort operations are still gated by the quality of mechanical devices.
And it’s not just that a petabyte sort takes longer. All this spinning and sliding wears the disk drive mechanism off, which leads to data loss. A desktop computer with a single drive might suffer a failure once in five or ten years. A petabyte sort using thousands of disks will probably lose at least one before it finishes.
You also need to worry about network optimization. It takes hundreds of disk drives to hold a petabyte, more than a single computer can manage. You need a network of computers, and you’ll have to shuffle the data between them carefully to avoid network bottlenecks.
By analogy, consider sorting books in a library. Say they are at first stacked by publication date and you want to order and shelve them by author. If it’s your home library, and you can reach most books by taking a couple steps, you’ll probably do the job in a few hours. If you’re resorting the Library of Congress, you’ll need significant capacity planning, temporary storage, transportation, bookkeeping, quality assurance, an army of workers and managers to coordinate them — in other words, a robust and scalable process. And however careful you are, you’ll still probably damage a few books in the process.
Sorting a petabyte is somewhat like that. You need load balancing, temporary storage, data transport, bookkeeping, quality assurance, quite a bit of hardware and resource management. And you are not allowed to lose even a single bit of data. If you do, you need to restart from the beginning. No wonder there aren’t many reports of petabyte sorts.
Getting It Done
At Quantcast we had to deal with large amounts of data early, before Hadoop started to mature. This pushed us to develop our own processing stack, quite similar to Hadoop at interface level, but focused from the beginning on scalability to petabytes. See our previous articles on the open source distributed file system QFS and our custom MapReduce implementation.
The petabyte sort challenge was larger for us than for Google or Yahoo given that we used significantly less hardware. Also, we do not have large test clusters, so we had to run it on our production cluster, concurrent with production workload, without delaying production data set delivery. Here are the key features that made our result possible.
- The Quantcast File System. Temporary sort spills are kept on QFS rather than on the local file system. This helps two ways. First, it helps balance CPU, disk I/O and network capacity. Writing locally is preferred, but when you run out of space locally you can spread to storage on a remote node seamlessly. Second, it ensures fault-tolerance. A broken disk, a server crash and even silent disk sector corruption get detected by QFS checksums and corrected automatically.
- Quantsort, our custom-built sort architecture. Quantsort is fundamentally better than Hadoop at scheduling as it keeps tasks short while keeping the I/O size large (over 1 MB). In Hadoop MapReduce large sorts require very large tasks that run for hours. When one of these tasks fails towards the end, it needs to be restarted, which can delay the whole job by hours. With Quantsort, tasks take no more than about 10 minutes even for the petabyte sort, so hardware failures don’t cause more than a 10-minute delay even if they happen late in the process.
- Outer-track optimization. The outer edge of a disk moves faster, and therefore reads or writes data faster, than the inner edge. We use a dedicated QFS instance to store all the sort spills, set up on the outer track disk partitions across all disks, which nearly doubles our effective I/O.
Results and Comparisons
We ran a generic MapReduce job that sorted about a trillion records of varying lengths averaging about 1 kilobyte. The data was taken from our production processes, not produced synthetically. We also used production data, not synthetic data. We ran the job on one of our production clusters, concurrently with our normal production workload but at low priority to make sure it would not affect production jobs. No production SLAs were missed due to this experiment.
Our sort finished in 7 hours, 22 minutes, an hour or so slower than Google’s published result and about twice as fast Yahoo’s. See the table for details. And note that we pulled this off with a fraction of their hardware. Yahoo used three times as many drives, Google more than 10 times as many.
|Elapsed Time||6 hours, 2 minutes*||16 hours, 12 minutes||7 hours, 22 minutes|
|Uncompressed Size||1 petabyte||1 petabyte||1 petabyte|
|Sort Spill Compression Factor||1x||1x||4x|
|Setup||Dedicated||Dedicated||Shared with production|
*Google disclosed faster and larger sorts in a later announcement, but the company did not present details on its experimental setup.
The tests differed in some other ways too. Unlike Google and Yahoo we did not follow the Terasort benchmark rules and used compression during the sort, which lowered our network and I/O load. On the other hand, other production jobs were adding their own network and I/O load during the two times we executed this test. We are now using the petabyte sort as the final phase of our QFS release qualification process.
We’ve built a world-class data stack that gives Ferrari performance on a jalopy budget. Although some of the key ingredients of our petabyte sort are hard to replicate with the Hadoop stack, reserving the outer disk tracks for sort spills is a simple change that could be used to improve Hadoop performance on large jobs. We are glad to share this simple but powerful improvement with the Big Data community.
Written by Silvius Rus, Michael Ovsiannikov and Jim Kelly
It’s every advertiser’s goal to capture their customers’ attention at the right time and place, but with so much content out there across so many devices, that window of opportunity becomes more and more important. As an industry, we need to find relevance―at the risk of getting lost in the noise.
In a webinar with the American Marketing Association (AMA) and Critical Mass this week, we explained how machine learning helps drive better advertising decisions and identified the key ingredients to successfully leverage real-time display buying. Listen to our VP of Performance Engineering, Michael Recce, leverage his background in neuroscience and artificial intelligence to simplify the science behind predictive modeling, in the full recording here!
And if you have any questions about getting started with programmatic buying, please shoot a note to firstname.lastname@example.org.
Posted by Stephanie Park, Demand Generation Manager, Performance
On a Friday afternoon in April 2012, I was watching terabytes go by sitting at my desk across from Yerba Buena Gardens, when I saw Jim Kelly, our VP of R&D, walking towards me carrying two beers. We had had a number of conversations about large-scale analytics over the previous weeks. I could tell there was more than beer coming my way.
Big Data Equals Big Value – If You Can Make Sense of It
Quantcast has no shortage of data. Every day we ingest about 50 terabytes of new semi-structured data. More data is good. It makes our predictive models better. It gives us confidence in our research, development and operations, both on the engineering and business side. However, before we get any value out of it, 50 terabytes simply means more bits to take care of, index, process and organize. 50 terabytes a day means 4.5 petabytes a quarter, a scale at which making sense becomes a big problem. Big data is certain to cause process complexity to increase from its scale, but is not certain to be valuable, unless you are able to make sense of it timely and at a gigantic scale, either through machine learning or as a human. Jim was coming to talk about the “making sense to a human” part.
SQL Helps Change Big Data into Big Value
At Quantcast we all want to make sense of big data, but in very different ways. Sales wants to know our reach in a geographic location in Brazil. Client Services wants to double-check that campaign delivery matches the configuration. Business Intelligence and Product wants to find the next big business opportunity. Modeling Scientists want to validate model quality. Engineers want to research and validate assumptions on value distribution and invariants. One way to do all this is to write Java or C++ programs and run them over MapReduce. That will work technically for engineers, but it is out of reach to users outside Engineering. Most of these users came from other companies, where they had used SQL to mine data, and that’s what they’re used to. For non-engineers, SQL is great. It gives them direct access to data. Without SQL they need to work with an engineer A task that can be done with SQL in minutes may take a week of back and forth with an engineer. SQL is good for engineers as well. While I can write a MapReduce job in minutes, it’s tedious to write and deploy, and it’s so easy to make mistakes. SQL makes it much harder to make mistakes and is quicker to write, so it helps even engineers become more productive. These are all good things – except we did not have a way to run SQL over our petabytes of data.
SQL over MapReduce?
I was already an SQL convert, and so was Jim. We were not about to discuss whether it’s good to get SQL to run over our data, but how. “It would take several quarters to get SQL to run over our data,” I said. Jim’s face turned to a grimace. “Don’t say quarters. Is there any way you can get us access to data through SQL quickly?” I thought about it over several sips of an IPA. I like challenges but, seriously, SQL on petabytes, and quickly? Something has to give. I started to think critically about the shortest path to a somewhat usable solution. We had already been running MapReduce at petabyte scale, though we wrote Java code for every job. Could we write a generic Java MapReduce job that could execute any SQL query? Grouping and ordering can be layered over the MapReduce sorter mechanism. Joins against dimension tables can be implemented by broadcasting them to all the MapReduce workers. Joining fact tables could be layered on our join mechanism we had previously developed atop MapReduce. There was still the matter of SQL syntax, which is quite rich and hard to replicate in Java. And, to complicate matters, we also wanted to make it possible to run our Java code as User Defined Functions (UDFs), because our data is semi-structured, and thus needs to be parsed by our Java libraries. And there are some other issues that I decided we should just punt on. Neither of these issues seemed showstoppers, so I said, “okay.”
Old Horse, New Tricks
I came up with the fastest hack that allowed us to run SQL at scale on our data. It took five weeks of coding by myself. It supported almost arbitrary SQL expression syntax. It supported grouping, ordering and hash joins. (Later we added merge joins for fact tables.) Here is what came out.
It’s essentially a generic MapReduce that streamed input data to a PostgreSQL server running on the same machine. Each map/reduce process streamed its input into PostgreSQL and then executed a lightly modified version of the query, so that the execution of the complex SQL syntax gets delegated entirely to PostgreSQL. From a development perspective I wrote six modules:
- Query parser. A quick and dirty regular expression parser that only recognized anchors such as SELECT, FROM, GROUP BY and a few more. It was somewhat fragile, but it worked fine with care.
- Grouping and sorting translation to MapReduce grouping. This simply required translating SQL grouping and sorting expressions to our MapReduce sort key formulas.
- Schema translation. Our data is stored in a proprietary binary or text format described by Java classes and custom parsers. I used Java class reflection to infer the field names and types and bind them to the query. To infer the types and names of the SQL query output we execute a stripped-down version of the query on an empty table during the preflight routine on a local PostgreSQL server.
- Split and rewrite the query for the map/combine and reduce phases. Some parts of the query cannot be run in both phases. For instance, the map phase should run the GROUP BY part of the query in order to benefit from a combining effect, but it must not run the HAVING clause or it would produce incorrect results.
- Generic mapper and generic reducer. Every mapper inserts up to 4 MB of data into table T1 in the PostgreSQL instance on the same hardware host. It then continues to insert values, though into a different table T2, and at the same time starts executing the query on table T1, and emits the query result rows (possibly grouped and aggregated) to the distributed MapReduce sorter. This pipelining of insertion and execution continues until end of input. The reducer works similarly. Additionally, it ensures that the combiner query may get run repeatedly until all data for a reduce group fits in 4 MB. (If this is not possible the query gets aborted. It has not happened in practice.) Once data has been combined the final query is executed, including the HAVING clause. For most queries the input to the reducer already fits in 4 MB per group, so there is only one stage in the reducer.
- User defined functions. I used Java reflection to recognize and invoke arbitrary static functions on fields. The Java calls are made between the time when rows are read from a file to the point they get sent to PostgreSQL. This has some limitations but it worked in many cases. One example is extracting a particular value from a serialized protocol buffer passed as a string field, encoded base64.
At the end of the five weeks the code was functional at scale. We did not have a PostgreSQL specialist on staff so we treated it as a black box. As we use it only as a block streaming processor, we set it up with a limited RAM budget and made sure it never had to touch a spin disk. We did not have to change our MapReduce implementation at all. Our cluster SQL was implemented as yet another MapReduce pipeline.
We tested it all on a job that ran over hundreds of terabytes on a 1,000-node cluster with only twenty terabytes of RAM. This ensured that we pushed data through all the pipes, and it made it through correctly and consistently. All data was read from the distributed file system and written back to it. It ran on our unmodified data and it produced data in our common format, ready for use by all our other tools. Jim was happy, and the IPA was well spent.
We received great value from the five-week prototype. While there were certainly unpolished spots, such as syntax limitations and lack of custom aggregators, it bought us time, helped determine the business value by sifting through scores of petabytes, and helped us learn what our users wanted.
This initial prototype puts us well on our way to producing a full-fledged SQL-based analytics solution. There will be more about that soon, so stay tuned!
By Silvius Rus, Director of Big Data Platforms, and Jim Kelly, Vice President of R&D
Quantcast recently contributed a case study on our use of the Ganglia Monitoring System for an O’Reilly book. We think our case study provides an interesting look at our petabyte-scale infrastructure, so we’re publishing the case study as a white paper.
Here’s an excerpt to whet your appetite:
Quantcast offers free direct audience measurement for hundreds of millions of web destinations and powers the ability to target any online audience across tens of billions of events per day. Operating at this scale requires Quantcast to have an expansive and reliable monitoring platform. In order to stay on top of its operations, Quantcast collects billions of monitoring metrics per day. The Ganglia infrastructure we’ve developed lets us collect these metrics from a wide variety of sources and deliver them to several different kinds of consumers; analyze, report and alert on them in real time; and provide our product teams with a platform for performing their own analysis and reporting.
The monitoring infrastructure at Quantcast collects and stores about 2 million unique metrics from a number of data centers around the world, for a total of almost 12 billion metrics per day, all of which are made available to our monitoring and visualization tools within seconds of their collection. This infrastructure rests on Ganglia’s Atlas-like shoulders.
If you’d like to learn more about how Quantcast uses the Ganglia Monitoring System, click here to download the white paper!
- Posted by Adam Compton, Platform Operations Engineer
The Application Developers Alliance is one of the leading associations supporting app developers as creators, innovators and entrepreneurs. Quantcast is proud to be working with the Alliance to provide support and resources for its 25,000 members.
This week, the Alliance featured Quantcast developer Kevin Smith as its “Developer Spotlight” of the week. In the feature, Kevin describes his inevitable path to becoming a developer. Like many other avid gamers and developers, Kevin started coding around age 11 or 12 with QBasic. “I made everything from small text-based games to programs to help with homework,” says Kevin. The curiosity to program, and the opportunity to explore endless possibilities, is what Kevin says most developers have in common.
Kevin joined Quantcast to lead mobile measurement development. At Quantcast, Kevin works on making the Measure for Apps SDK more accessible to all kinds of app developers. Read the rest of his profile and his predictions for the next 10 years on the Alliance site here.
Posted by Maryam Motamedi, Product Marketing Manager, Quantcast Measure
Since 2006, over 100 million destinations have used Quantcast for their audience and traffic measurement. Many publishers log in to the Measure dashboard multiple times a day and have asked us for a programmatic way to access their data. That’s why we’re introducing the Quantcast Measure API (Application Programing Interface), now in Beta. The Measure API provides publishers with a flexible tool to retrieve and analyze their Quantcast data, allowing them to:
- Pull their audience and traffic data into an internal tool or dashboard
- Combine Quantcast data with other data sources for new insights
- Save audience data to track historical trends
More Customization and Insight
The Quantcast Measure dashboard provides a wealth of information, yet some publishers prefer to only see their daily people number, or already use a customized dashboard to monitor their site. With the Quantcast Measure API, publishers can choose exactly what Measure data they want to see, whether at the network, site, or subdomain level, and easily import it into their custom reports and dashboards.
The API also provides publishers new ways to analyze their Quantcast data – such as combining it with data from other sources. For instance, a travel site might analyze how the weather impacts their regional site traffic by combining Quantcast data with data from a weather provider. We’re looking forward to learning about all the creative ways publishers will use their Quantcast Measure data to generate enhanced insights.
Applying for Beta Access
Publishers with a Quantcast account can learn more about the API and apply to participate in the Beta program at our developer site here. Note that this API only provides publishers access to data for their owned properties.
Please contact us with any questions or feedback. We value your input and would love to hear from you as we expand the capabilities of our API.
Posted by Jon Katz, Product Manager, Quantcast Measure
It’s April 2011. Quantcast is running one of the largest MapReduce clusters out there. Engineers write Java code that gets executed across the whole cluster efficiently on terabytes of data. Wonderful!
Unfortunately, not everyone who needs access to data is an engineer. Also, even engineers don’t feel like writing a new MapReduce job every time they want to take a different look at a data set.
We realized we needed a more productive data analysis tool. The first thought was to get SQL to run on our petabytes of data, but that seemed like a large undertaking, as SQL data access semantics are fairly sophisticated, which implies a fairly large friction area with Quantcast’s MapReduce implementation. Another solution, simpler, was to get Google’s recently open-sourced Sawzall to run on Quantcast’s MapReduce cluster. Although as a language Sawzall is not as easy to use as SQL, especially to non-engineers, it’s still much simpler than Java. And it seemed reasonably easy to integrate Sawzall with Quantcast’s MapReduce implementation, as its interface to MapReduce is much narrower and better defined than SQL’s.
This theoretical “ease of integration with MapReduce” turned out (surprise!) harder than expected in practice. First, Sawzall was not open-sourced with a MapReduce harness, but only as a compilation/execution engine plus a command line tool with little practical utility. Second, Sawzall runs best on protocol buffers but Quantcast stores its data in different binary and text formats. Third, Quantcast’s MapReduce although based on Hadoop lacked streaming capability because it was branched off an old Hadoop version that predates streaming capability.
We integrated Sawzall execution with our cluster software by writing a generic MapReduce program that executes any given Sawzall program. The mappers and reducers stream data to and from Sawzall executors on each cluster worker machine. The data translation mechanisms (from Quantcast’s format to/from Google’s protocol buffers and the Sawzall binary output format) are pre-compiled before the job starts.
Dynamic Schemas + Multiple Type Systems = Need Dynamic Code Generation
The main difficulty came from the fact that the schema of the Sawzall output depends heavily on the content of the program, and can contain nesting and arrays. A top(10)[country] of frequency table will produce tuples with two fields, where the second field will be an array of 10 elements. This entirely dynamic schema behavior did not go well with our MapReduce framework, where schemas were determined and fixed once the MapReduce job was written. Moreover, we strongly preferred to flatten the output so that it could be exported to PostgreSQL. The solution was to deterministically flatten Sawzall output schemas. We implemented this schema flattener and educated users accordingly.
Also, when we started to draw boxes and connect them with lines we realized we had to exchange data over several type systems. Quantcast data is generally represented in the Java type system. Sawzall expects data in the Google Protocol Buffer type system, but it produces results in its own type system and encoding. We wrote code to handle all these type translations. For efficiency, rather than writing generic translation libraries, we built a code generator that created custom translation code for every job based on the Sawzall table semantics. The custom code was compiled with javac (for the Java components) and with gcc (for the C++ components) and shipped to mappers/reducers as byte code or object code, where it was loaded by each mapper/reducer at their start time.
Cool Language Wrangling, But How Well Did It Actually Work?
We could let the data speak. The numbers were collected over two years of operation.
The Quantcast Sawzall integration increased the velocity of our engineering and product teams, and helped the client services team.
Overall the Sawzall integration was a success. However, there were two main issues that over time have grown into larger problems. First, most of our code base is written in Java, while the Sawzall engine is in C++. This limits the level of interoperability and leads to code duplication and skew, which in turn leads to conflicting analytics results. Second, while the Sawzall language provided a productivity boost for scripts between 2 and 20 lines, it became hard to maintain larger scripts, especially for users outside engineering. Our current analytics direction has shifted away from Sawzall, and is based primarily on SQL + Java user defined functions and aggregators. More about SQL in a later post.
Written by Silvius Rus, Director, Big Data Platforms
With all of the excitement around the iOS 7 release, we took a look to see how people are adopting and upgrading to the new software and we are seeing some interesting trends.
Trends Among Devices
Our data shows a significant difference between the iPhone and iPad iOS7 upgrade rates, with iPhones generally upgraded faster than iPads. Presumably, this is due to the greater utility and ubiquity of smartphones versus tablets as the primary device for the user.
When analyzing adoption patterns between newer and older iPads, we see little difference in adoption patterns. This seems to support the idea that the slower upgrade rate on iPad devices is due to their role as less portable devices and less related to the freshness of technology.
The chart above demonstrates that when looking at data three days after the iOS 7 release, around 47% of iPads had been upgraded to the new software. By contrast, iPhones were generally upgraded at a faster rate, with the newer phone users exhibiting more early-adopter behavior. Roughly 53% of iPhone 4 and 4S devices and 66% of iPhone 5 devices had been upgraded to the latest OS version.
As a whole, males showed a slight positive bias in upgrading to the new operating system. Within the early-adopter group of iPhone 5 users, males had upgraded at about a 5% higher rate than females. iPad users showed a male bias as well, with males indicating 6% faster adoption.
Those who were already using newer OS versions upgraded at a faster rate as well. About 42% of those using a version of 6.1.x upgraded to 7.0 within two days, in contrast to 11% of the group on a version of 6.0.x.
Leveraging Quantcast Measure for Apps to See How Your Audience is Connecting
Insights like this can help you optimize your app for your audience so you can reach the sweet spot in terms of adoption. Using Quantcast Measure for Apps, the Top OS Versions feature will show you on which Operating Systems your audience is connecting.
Quantified publisher Topix is a great example. Their rate of adoption between day two of the release and day four shows significant increase in iOS 7 usage between the two samples. (Please note: The percentages add up to well over 100% in the second sample, due to multi-OS attribution over the past 30 days.)
We hope you find these insights interesting. To learn more about Quantcast Measure for Apps, visit here.
Posted by Maryam Motamedi, Product Marketing Manager, Wayne Yang, Engineering Manager, & An Jiang, Software Engineer, Quantcast Measure
This week in Italy, the Very Large Data Bases (VLDB) Conference is happening, and our very own Silvius Rus is presenting on the Quantcast File System (QFS). QFS was released to open source in September 2012, and it is the industry’s first high-performance alternative to the Hadoop Distributed File System (HDFS).
Evolved from intensive research and development around the Kosmos Distributed File System (KFS, also known as CloudStore), QFS delivers a higher capability option to HDFS for batch data processing, significantly improving data IO speeds and reducing the disk space required to reliably store massive data sets by 50 percent.
The VLDB conference is an international forum for data management and database research and is considered one of the top scientific international conferences in Big Data processing. In fact, Microsoft Academic Search ranks the VLDB conference the third Computer Science venue among more than 3,500 possibilities.
We’ll share more insight on the presentation and paper on QFS on this blog shortly. In the meantime check out VLDB for more information and insight.
Posted by Patrick Hornung, Sr. Manager, Communications
We’ve upgraded our measurement methodology to use both first-party and third-party cookie measurement, providing a more comprehensive view of web and mobile web traffic.
Quantcast Measure launched in 2006 using third-party cookies to aggregate traffic from multiple sites into single audience reports, and calculate audience demographics based on cross-site visitation patterns.
With the rise of smartphones and tablets there has been rapid growth in mobile web browsing. Today media consumption is truly cross-platform, and we want to ensure that Quantcast Measure maintains the most comprehensive and accurate traffic and audience measurement possible. So, throughout 2013 we’ve been working hard on a major measurement update to deliver traffic reporting that incorporates both first- and third-party cookies for even greater measurement accuracy regardless of the device used for media consumption.
With this change, your Quantcast Measure numbers will align more closely with analytics packages that use first-party cookies, such as Google Analytics and Omniture, while maintaining the flexibility to measure multi-site audiences. Your website may see an increase in measured traffic, particularly for mobile web audiences. The measurement update is applied retroactively to your historical data to ensure consistent reporting over time.
We’re committed to delivering the most accurate and robust cross-platform audience measurement. We think you’ll like this change and please let us know if you have any questions, comments or suggestions.
Posted by Jon Katz, Product Manager, and Sean McCormick, Director of Engineering, Measurement & Insights