In a previous post, we described Hadoop’s architectural limitations for large jobs of terabytes and beyond. The more tasks a job uses, the less efficient its disk I/O becomes. Fewer, larger tasks keep the I/O efficient but create operational and scheduling problems that get more painful the larger the job gets.

Quantcast developed an improved architecture called Quantsort and has been running it internally for three years. Here’s how it works.

Quantsort Design

Quantsort guarantees a large disk I/O size throughout all map-reduce stages, pushing disk I/O, the now constrained resource, to its limits. The number of map tasks does not directly affect the I/O size at any step of the map-reduce algorithm, so jobs can be broken down into tens or hundreds of thousands of short tasks. When a task fails, it takes seconds or minutes to redo it even for 100 TB sort jobs. Tasks fill up the available task slots nicely, and task preemption is unnecessary. With task duration of three minutes, on a cluster with 12,000 worker slots you get about 4000 free slots per minute that can be given to higher-priority jobs.

Quantcast quantsort data flow chart

Quantsort’s algorithm is altogether different from Hadoop’s. Instead of creating sorted runs early and then shuffling them later, it shuffles map output immediately into about 1024 partition files (a fixed number), in a process we call fanout. Because this number is fixed, independent of the number of map or reduce tasks, it keeps the seeks required in the merge to a manageable number.

Fanout requires many map processes to write to the same file concurrently—often 10,000 or more on our large production jobs. It works thanks to the concurrent append feature of the QFS file system, which uses RAM buffering and multiple physical machines to ensure disk writes are large (1 MB or more) and efficient.

Fanout also leverages QFS’s replication feature. Because a large job will put the entire cluster to work for several hours collecting and sorting data, a disk failure affecting a fanout file is not unlikely and can be particularly expensive. Re-creating the file involves re-running thousands of map tasks. We therefore configure QFS to make a second copy of each fanout file, so the job is more likely to complete.

Separate sorter tasks then split each partition file into a run for each reduce task and sort the runs. The resulting runs, grouped by reduce task and sorted, are also significantly larger than 1 MB and can be read efficiently.

Quantsort Advantage

The following table compares the four main bottlenecks of a theoretical 10 TB uncompressed sort job, using the Hadoop and Quantsort algorithms. It assumes 10,000 tasks in both map and reduce phases and 2500 disk drives in the cluster and excludes the I/O associated with map input and reduce output.

Quantcast quantsort data comparison table

Disk Seeks: Our hypothetical cluster has 2500 drives and can therefore execute 125,000 seeks per second. Quantsort needs 8 seconds; Hadoop needs 800 seconds.

Disk I/O: Our 2500 drives can deliver about 1 Tbps of disk I/O. Quantsort requires 55 TB of disk I/O: 20 TB for fanout (replication 2), 10 TB for sort input, 15 TB for sort output (encoded Reed-Solomon), and 10 TB for the merge (reading directly from QFS). Hadoop requires at least 60 TB of disk I/O: 10 TB for sort spills, 10 TB for merge input, 10 TB for merge output, 10 TB of shuffle input, 10 TB of shuffle output, and 10 TB of reduce-side merge input. When two passes are needed, an extra 20 TB is needed. 55 TB at 1 Tbps disk I/O takes 440 seconds. 60 TB takes 480 seconds and 80 TB takes 640 seconds.

Core switch I/O: Using commodity network hardware, disk I/O can be matched efficiently to core network 1:1 up to hundreds of 12-disk servers, so our hypothetical network can keep up with our 2500 hypothetical disks. Quantsort will load it more heavily than Hadoop. Assuming 1 Tbps core switch bandwidth, the Hadoop sort would fully load the network for 80 seconds, Quantsort for 184 seconds.

Task failure recovery time: This is perhaps the most critical differentiator. The fact that Quantsort failures take three minutes to recover leads to dramatic improvement in practice. It means jobs complete faster, pleasing users. And it saves the organization from preemptive scheduling and the lost capacity it implies, since cluster resources are constantly becoming available to the scheduler to service higher-priority jobs.

Quantsort in Practice

We would share a 20 TB sort benchmark comparing Quantsort to Hadoop on our production cluster, but our impatience got the better of us. One hour into our benchmark, Hadoop was just restarting some failed map tasks, and we ran out of dedicated experiment time. Quantsort had been finished for half an hour.

We can say Quantsort runs well. We’ve had it live for over three years, ever since its inaugural 100 TB sort in 2009. Today it sorts over a petabyte per day, including dozens of jobs of 10 TB or more. Its record on a production job is 971 TB, which took 6.4 hours.

Written by Silvius Rus, Jim Kelly, and Michael Ovsiannikov on behalf of the Cluster R&D Team at Quantcast.