Big Data Babylon (Βαβυλών, בָּבֶל ,بابل‎)

Over the last few years Big Data has become more important and critical to the world. Increasingly, business value seems to be discovered by mining large amounts of data. However, the Big Data technology landscape is very fragmented. There are a number of established database/analytics providers, there’s Hadoop (in a number of flavors) and there are a myriad of homegrown systems. For instance, Facebook and Google alone have publicized six analytics tools all loosely based on the SQL language, yet different: Hive, Peregrine, Presto, Dremel, Tenzing and F1. While each of these systems solved important business problems, most of them do not follow industry standards due to development time pressure or technology limitations imposed by scalability. They do not speak exactly the same language and don’t see data the same exact way. Moreover, the data they process is often produced by programs written in a language significantly different from SQL, such as C++ or Java. When you put it all together it mostly works, except when it doesn’t. And when it doesn’t, it is often very hard to understand the error or track it down because it’s not related to a specific application code change.

Data Languages at Quantcast

At Quantcast we’ve certainly experienced some of this while processing a few exabytes of data in various formats. At first we used mostly Hadoop/Java and PostgreSQL to process data. Then we adopted Sawzall, which increased our productivity and made our data available to non-engineers. Prior to Sawzall, product managers would normally have to work with an engineer for days to analyze a data set, but with Sawzall they could now analyze large data sets themselves.

Using Sawzall, however, brought a new set of challenges. It didn’t play all that well with our Java codebase. We had to translate back and forth between four type systems: Java, protobuf, Sawzall, and PostgreSQL. We’ve seen issues with signed vs. unsigned integers, character encoding, wide vs. narrow types, and nested vs. flat structures. These issues can cause serious problems because they may go undetected at first, causing subtle behavior changes downstream.

System Language Input Type Output Type
MapReduce Java Java Subset Java Subset
Sawzall Sawzall and C++ protobuf Sawzall
PostgreSQL SQL SQL SQL

Surviving Babylon: It Helps to Know Ahead

Given all of the risks around throwing a mix of useful but different languages at an already unwieldy Big Data problem, we wanted to highlight one of the ways we made sense of it all.

Our first measure was organizational. We standardized a single reference type system, which was essentially a subset of Java basic types and arrays that have a precise match in PostgreSQL.

On the system side, we took control of type conversions. Essentially all data passing between systems either goes through Java or is made in a binary format that is compatible with our Java type system subset. For instance, Sawzall programs produce output in a proprietary type system. We have (1) written a translator that automatically converts to protobuf and (2) limited the capability of the translator to only types that can be safely translated to our reference type system.

Perhaps the most successful measure in dealing with multiple type systems was designing the system so that type checking is 100% performed before data processing starts, and programming it to abort execution early on mismatches. In other words, if you’re going to be running a 10-hour data pipeline, it really helps to give up after seconds instead of wasting hours, when you know the data shape/types won’t match at some point. Even complex data pipelines made of a mix of steps written in Sawzall, Java and PostgreSQL are type-checked precisely 100% at execution planning time, across type systems and across languages.

Precise and Complete Type Checking Across Multiple Languages

Here is how we got full type-checking across Java, Sawzall and PostgreSQL.

  1. Our data pipeline engine is written in Java. The pipeline execution plan is produced entirely before starting execution. Type checking happens in Java while the execution plan is being assembled. The data record abstraction is a flat row in which each element is a basic Java type or array thereof.
  2. Each language must provide a record schema translator that, given a program in that language and an input record format, produces the exact output record format. All input/output record formats conform to the data record abstraction requirements. Also, each language should provide a type checking mechanism to ensure that the program is valid and matches the input/output format.
    • In Java, this requirement is provided by declaring record formats as Java classes containing only basic field types (or arrays). A Java pipeline step takes as input objects of such a class and produces as output objects of such a class. We require programmers to specify input/output classes statically. This information becomes available to the checker through reflection.
    • In Sawzall, this requirement is fairly easily satisfied on the input side. Flat rows containing Java types map easily into protobufs, the Sawzall input schema. We construct the protobuf definition at pipeline assembly time and compile the Sawzall program against it to continue type-checking fully within the Sawzall program. On the output side, however, the schema and types will change depending on the text of the Sawzall program. We developed a schema translator that grabs Sawzall output table schemas from the output of static analysis and converts them to record formats that can be processed by the global checker.
    • We interact with PostgreSQL through JDBC. At planning time we execute the SQL query against an empty table with a schema translated from the input record format. This accomplishes three goals. First, it forces PostgreSQL to validate that the query matches the input record format. Second, it forces PostgreSQL to perform type checking throughout its expressions and aggregators. Third, it forces PostgreSQL to produce an empty result set that contains the output schema definition. We translate this back into our standard record format so the checker can carry it forward to its consumer, be it Java, Sawzall or PostgreSQL.
  3. In addition to the pipeline assembly time checks, we also generate code to perform binary data translation between the different type systems. The code includes runtime checks or conversions such as the ones needed to avoid overflow.

Bottom Line

We’ve shown how we managed to use Java, Sawzall and PostgreSQL together to solve Big Data problems reliably. Each language has its strengths, but using them together can cause trouble. While they do seem to work together well at first sight, proving that they do work together well is both nontrivial and needed in order to provide reliability at scale. At Quantcast we have transferred hundreds of petabytes between Java, Sawzall and PostgreSQL with the certainty of early and complete type checking across components written in these three languages.

Written by Silvius Rus on behalf of the Big Data Platforms team at Quantcast.