It’s April 2011. Quantcast is running one of the largest MapReduce clusters out there. Engineers write Java code that gets executed across the whole cluster efficiently on terabytes of data. Wonderful!
Unfortunately, not everyone who needs access to data is an engineer. Also, even engineers don’t feel like writing a new MapReduce job every time they want to take a different look at a data set.
We realized we needed a more productive data analysis tool. The first thought was to get SQL to run on our petabytes of data, but that seemed like a large undertaking, as SQL data access semantics are fairly sophisticated, which implies a fairly large friction area with Quantcast’s MapReduce implementation. Another solution, simpler, was to get Google’s recently open-sourced Sawzall to run on Quantcast’s MapReduce cluster. Although as a language Sawzall is not as easy to use as SQL, especially to non-engineers, it’s still much simpler than Java. And it seemed reasonably easy to integrate Sawzall with Quantcast’s MapReduce implementation, as its interface to MapReduce is much narrower and better defined than SQL’s.
This theoretical “ease of integration with MapReduce” turned out (surprise!) harder than expected in practice. First, Sawzall was not open-sourced with a MapReduce harness, but only as a compilation/execution engine plus a command line tool with little practical utility. Second, Sawzall runs best on protocol buffers but Quantcast stores its data in different binary and text formats. Third, Quantcast’s MapReduce although based on Hadoop lacked streaming capability because it was branched off an old Hadoop version that predates streaming capability.
We integrated Sawzall execution with our cluster software by writing a generic MapReduce program that executes any given Sawzall program. The mappers and reducers stream data to and from Sawzall executors on each cluster worker machine. The data translation mechanisms (from Quantcast’s format to/from Google’s protocol buffers and the Sawzall binary output format) are pre-compiled before the job starts.
Dynamic Schemas + Multiple Type Systems = Need Dynamic Code Generation
The main difficulty came from the fact that the schema of the Sawzall output depends heavily on the content of the program, and can contain nesting and arrays. A top(10)[country] of frequency table will produce tuples with two fields, where the second field will be an array of 10 elements. This entirely dynamic schema behavior did not go well with our MapReduce framework, where schemas were determined and fixed once the MapReduce job was written. Moreover, we strongly preferred to flatten the output so that it could be exported to PostgreSQL. The solution was to deterministically flatten Sawzall output schemas. We implemented this schema flattener and educated users accordingly.
Also, when we started to draw boxes and connect them with lines we realized we had to exchange data over several type systems. Quantcast data is generally represented in the Java type system. Sawzall expects data in the Google Protocol Buffer type system, but it produces results in its own type system and encoding. We wrote code to handle all these type translations. For efficiency, rather than writing generic translation libraries, we built a code generator that created custom translation code for every job based on the Sawzall table semantics. The custom code was compiled with javac (for the Java components) and with gcc (for the C++ components) and shipped to mappers/reducers as byte code or object code, where it was loaded by each mapper/reducer at their start time.
Cool Language Wrangling, But How Well Did It Actually Work?
We could let the data speak. The numbers were collected over two years of operation.
The Quantcast Sawzall integration increased the velocity of our engineering and product teams, and helped the client services team.
Overall the Sawzall integration was a success. However, there were two main issues that over time have grown into larger problems. First, most of our code base is written in Java, while the Sawzall engine is in C++. This limits the level of interoperability and leads to code duplication and skew, which in turn leads to conflicting analytics results. Second, while the Sawzall language provided a productivity boost for scripts between 2 and 20 lines, it became hard to maintain larger scripts, especially for users outside engineering. Our current analytics direction has shifted away from Sawzall, and is based primarily on SQL + Java user defined functions and aggregators. More about SQL in a later post.
Written by Silvius Rus, Director, Big Data Platforms