A Brief History of Quantifying Populations
The Hollerith tabulator was invented in response to the tremendous data processing problem of the 1890 U.S. census. It was a marvel of electrical and mechanical technology, and according to Wikipedia “it spawned the data industry.”
One hundred twenty years later at Quantcast we are solving similar, though much harder data problems. They involve over a billion people and are much more complex than a census: modeling the behavior of the Internet population and comparing it to the behaviors of known buyers of products and services. In contrast to traditional advertising, we don’t group users in broad categories, but rather we make custom advertising decisions for every single Internet user with respect to each of our advertising campaigns. This allows us to be precise and accountable.
The catch is, we need to operate at Internet scale, and the Internet is vast. We trawl through an ocean of data, such as the visit patterns from billions of browsers to hundreds of millions of web destinations. The data arrives to us in chronological order as pages get visited, but we need to sort it in all kinds of ways for different uses. In our main compute cluster we spend about half of our I/O capacity sorting data. Over 90% of the data we sort is from jobs larger than 1 terabyte each. Over 10% is from jobs larger than 100 terabytes each. The largest production sort was over 400 terabytes.
We like to keep capacity ahead of needs, so one day we thought to ourselves, “we ought to sort a petabyte.”
The Scale Challenge
The only public records of sorting at petabyte scale so far came out of Google and Yahoo, because it’s a hard problem. Although sort algorithms such as quicksort were perfected long ago, they assume the data will fit in a computer’s RAM where it can be read and written very quickly. But when data doesn’t fit in one computer’s RAM you need to use external storage, which generally means spinning disks. These are much slower than RAM because you’re not just flipping transistors but sliding disk heads and waiting for disks to spin to the right position. A century after Hollerith, the largest sort operations are still gated by the quality of mechanical devices.
And it’s not just that a petabyte sort takes longer. All this spinning and sliding wears the disk drive mechanism off, which leads to data loss. A desktop computer with a single drive might suffer a failure once in five or ten years. A petabyte sort using thousands of disks will probably lose at least one before it finishes.
You also need to worry about network optimization. It takes hundreds of disk drives to hold a petabyte, more than a single computer can manage. You need a network of computers, and you’ll have to shuffle the data between them carefully to avoid network bottlenecks.
By analogy, consider sorting books in a library. Say they are at first stacked by publication date and you want to order and shelve them by author. If it’s your home library, and you can reach most books by taking a couple steps, you’ll probably do the job in a few hours. If you’re resorting the Library of Congress, you’ll need significant capacity planning, temporary storage, transportation, bookkeeping, quality assurance, an army of workers and managers to coordinate them — in other words, a robust and scalable process. And however careful you are, you’ll still probably damage a few books in the process.
Sorting a petabyte is somewhat like that. You need load balancing, temporary storage, data transport, bookkeeping, quality assurance, quite a bit of hardware and resource management. And you are not allowed to lose even a single bit of data. If you do, you need to restart from the beginning. No wonder there aren’t many reports of petabyte sorts.
Getting It Done
At Quantcast we had to deal with large amounts of data early, before Hadoop started to mature. This pushed us to develop our own processing stack, quite similar to Hadoop at interface level, but focused from the beginning on scalability to petabytes. See our previous articles on the open source distributed file system QFS and our custom MapReduce implementation.
The petabyte sort challenge was larger for us than for Google or Yahoo given that we used significantly less hardware. Also, we do not have large test clusters, so we had to run it on our production cluster, concurrent with production workload, without delaying production data set delivery. Here are the key features that made our result possible.
- The Quantcast File System. Temporary sort spills are kept on QFS rather than on the local file system. This helps two ways. First, it helps balance CPU, disk I/O and network capacity. Writing locally is preferred, but when you run out of space locally you can spread to storage on a remote node seamlessly. Second, it ensures fault-tolerance. A broken disk, a server crash and even silent disk sector corruption get detected by QFS checksums and corrected automatically.
- Quantsort, our custom-built sort architecture. Quantsort is fundamentally better than Hadoop at scheduling as it keeps tasks short while keeping the I/O size large (over 1 MB). In Hadoop MapReduce large sorts require very large tasks that run for hours. When one of these tasks fails towards the end, it needs to be restarted, which can delay the whole job by hours. With Quantsort, tasks take no more than about 10 minutes even for the petabyte sort, so hardware failures don’t cause more than a 10-minute delay even if they happen late in the process.
- Outer-track optimization. The outer edge of a disk moves faster, and therefore reads or writes data faster, than the inner edge. We use a dedicated QFS instance to store all the sort spills, set up on the outer track disk partitions across all disks, which nearly doubles our effective I/O.
Results and Comparisons
We ran a generic MapReduce job that sorted about a trillion records of varying lengths averaging about 1 kilobyte. The data was taken from our production processes, not produced synthetically. We also used production data, not synthetic data. We ran the job on one of our production clusters, concurrently with our normal production workload but at low priority to make sure it would not affect production jobs. No production SLAs were missed due to this experiment.
Our sort finished in 7 hours, 22 minutes, an hour or so slower than Google’s published result and about twice as fast Yahoo’s. See the table for details. And note that we pulled this off with a fraction of their hardware. Yahoo used three times as many drives, Google more than 10 times as many.
|Elapsed Time||6 hours, 2 minutes*||16 hours, 12 minutes||7 hours, 22 minutes|
|Uncompressed Size||1 petabyte||1 petabyte||1 petabyte|
|Sort Spill Compression Factor||1x||1x||4x|
|Setup||Dedicated||Dedicated||Shared with production|
*Google disclosed faster and larger sorts in a later announcement, but the company did not present details on its experimental setup.
The tests differed in some other ways too. Unlike Google and Yahoo we did not follow the Terasort benchmark rules and used compression during the sort, which lowered our network and I/O load. On the other hand, other production jobs were adding their own network and I/O load during the two times we executed this test. We are now using the petabyte sort as the final phase of our QFS release qualification process.
We’ve built a world-class data stack that gives Ferrari performance on a jalopy budget. Although some of the key ingredients of our petabyte sort are hard to replicate with the Hadoop stack, reserving the outer disk tracks for sort spills is a simple change that could be used to improve Hadoop performance on large jobs. We are glad to share this simple but powerful improvement with the Big Data community.
Written by Silvius Rus, Michael Ovsiannikov and Jim Kelly