(Or How to Run Sawzall on 1 Terabyte of Text in 7 Seconds)
By Silvius Rus, on behalf of the Cluster Team at Quantcast
A few weeks ago I was talking to our CTO. I was looking for wisdom on how to convince Java users to start programming in a new language. We had just ported Sawzall, a highly productive mapreduce language from Google, to run on our proprietary Mapreduce cluster based loosely on Hadoop. Even though Sawzall gives a clear productivity boost, old habits take extra convincing to change. The advice I got was “make it run in seconds rather than minutes and everyone will jump at it”.
Being able to run Sawzall in seconds on large data sets can certainly be a game changer. Instead of using it to run batches of jobs, it can be run interactively to “find the needle in the haystack”. It can give engineers, data modelers and business analysts a handle on petabytes of data, quickly. It’s empowering. It makes you feel like you have the data in front of you. Once you have it, it’s hard to remember how you lived without it.
- Sawzall language and runtime (open-sourced by Google). Expressive, productive.
- Proprietary distributed file system. As fast as (the hardware) can be.
- Simplified mapreduce implementation. Unnervingly simple and efficient, unorthodox.
- Tornado web server (open-sourced by Facebook). Scalable, easy to use.
- New HTTP based job management mechanism. Lightweight as it should.
The first thing you need in order to run Sawzall programs is, well, a Sawzall implementation. We picked up Google’s open source implementation. Next you need to get Sawzall to read and write your data. At Quantcast we store data in a proprietary format. We decided to convert it on the fly to protocol buffers, feed it to Sawzall, and then translate it back from Sawzall output to Quantcast’s format.
We got Sawzall to run on our Mapreduce cluster but job latency on medium data sets was high, as the whole system was optimized only for overall cluster throughput. It was taking minutes, not seconds, to run a Mapreduce on 1 TB of text starting from disk. It does not make sense to cache data in RAM because users select this input set from a much larger data repository and we don’t know the selection before the job is launched.
Performance analysis showed the slowness came from starting many tasks at once, from task stragglers caused by disk or network bottlenecks, and from the separation of the map-sort-reduce phases. We could spend months on a thorough solution to these problems but we’re a nimble startup, so we set out to do it in weeks. My bet was that we implement quickly a Sawzall Mapreduce implementation that is extremely fast at the cost of absolute accuracy and repeatability.
First we had to get the user buy-in. My promise was that we could speed it up 10x by dropping the 10% slowest work units. Users believed this would be great for research, prototyping and debugging, all of which take a large part of the development cycle. Essentially, it’s like “grep” or “find” to Unix programmers.
The Main Course
The 1st design decision was to drop stragglers instead of dealing with them. This simplifies design greatly. No worries about the long latency of a distributed file system transient failure, or an overcommitted CPU, a thrashing OS or a nuked network link. Drop, drop, drop.
The 2nd design decision was to drop the sort phase of mapreduce and to run the reducer concurrently with the mapper. We keep a single reducer per Mapreduce job. The reducer keeps all the reduce groups in a RAM hash table. It uses at most 1 GB-ish of RAM, which is usually enough as Sawzall is very good at combining records within the mapper. Mappers are very simple. Each mapper launches a given Sawzall program on a given file and sends Sawzall table output to the reducer.
The 3rd design decision was to rewrite the job management mechanism so it can start many tasks in a second. We built a single master on top of the Tornado single threaded asynchronous HTTP server. The workers themselves, one per core, are babysat locally by daemontools. They open long-lived HTTP connections to the master and wait for work. The master holds all these connections until a job gets posted by a user (over HTTP). Once it gets the job input list, it immediately replies to all the waiting clients. It does so within a second, and uses less than a second of CPU during the execution of a Sawzall query on 1 TB of text.
As with all good recipes, there are secret ingredients. More about them in a future post.
1st week: Code furiously. Learn as I go. Get a Sawzall mapreduce to run on a single machine. Add nice features like optional combiner spills on SIGALRM, or early exit for “find first” jobs. Exhilarating. Adrenaline rush. Feels like success is near.
2nd week: Realize that most engineers don’t really care for results in Sawzall format. Write a binary record converter. Technicalities sneak in. The Quantcast file writer is in Java. Should I throw a JVM in the mix? Bit the bullet and rewrote it in C++, so Sawzall results can be written directly into Quantcast files. This is starting to look good, but now users heard rumors and would like to try it out. Ouch, it needs a job management system. We have Hadoop, but Hadoop seems to prefer minutes to seconds and right now I only care for seconds. What to do, what to do?
3rd week: Write a job management mechanism that holds its own at scale. Tornado saves the day. Impressive. Try out a reasonably large end-to-end test (1 TB uncompressed). Nervousness is warranted. The largest I tried in early tests was 100 GB. Now I’m running on a reasonably large data set and there are other unrelated jobs, some larger than mine, running at the same time, competing for disk, network and CPU as well. Eyes looking over my shoulders, mental drum roll. It’s short and sweet: it simply works and completes in 7 seconds. Prompt to prompt, from clicking “go” to having results on the local machine in a Quantcast format file.
Draw positive expletives from colleagues. Priceless.
The Not So Small Print
All this would not have been possible without the Cluster Team’s earlier experience of getting Sawzall to run over Hadoop. That took longer than three weeks and it’s worth a blog entry by itself.
All this would not have been possible without Quantcast’s amazing file system, which didn’t have to be tweaked in any way for this new development. It just worked well.
All this would not have been possible without Quantcast’s Ops engineers, who produced a sound and scalable deployment mechanism so I could push the system to production the very hour I finished development.
In case the diagram above isn’t clear, let’s spell it out. We built a lightweight, scalable, fast though approximate Sawzall Mapreduce implementation. It ran a simple Sawzall program on 1 TB of text in 7 seconds starting from disk and it doesn’t rely on any prior setup (no Hadoop required).
Do you think you can beat the 7 seconds mark? Bring it on! We’re looking to hire bright cluster engineers to push the state of the art in large scale computing on commodity hardware.