The Quantcast File System (QFS) was first introduced in early 2013 as an alternative to the Hadoop Distributed File System (HDFS) for large-scale batch data processing and storage. Today we’re introducing 1.1.4 to the community, with a number of substantial improvements and new features including 1) network communication integrity and confidentiality and 2) user authentication and authorization.
Network Communication Integrity and Confidentiality
QFS can now encrypt network traffic from the client to both meta and chunk servers, in addition to between meta and chunk servers. This helps to ensure that traffic is not snooped or altered in transit, and allows clients to communicate securely with meta and chunk servers across the open Internet.
User Authentication and Authorization
QFS now supports validating that the user attempting to access a resource is who they say they are (as verified by Kerberos or an x509 certificate) and also enables Unix-style file permissions.
To address the higher CPU requirement that comes with the authentication and encryption features, the chunk server can be configured to utilize multiple CPUs to serve client requests and perform data replication and recovery. This feature can be used with or without QFS security enabled.
This release also delivers a number of enhancements designed to simplify QFS administration and make QFS deployment less error prone. Please refer to the release notes below for more details.
New features since QFS 1.0.2:
- QFS security modules: network communication integrity and confidentiality achieved through encryption, and user authentication and authorization. Supported authentication protocols: Kerberos 5 and x509 certificate chains with TLS. The detailed design document is available here.
- Chunk servers can now be configured to use multiple CPUs to handle network IO, authentication, network communication encryption, and chunk recovery.
- Automatically create and use a unique file system ID. This guards against misconfigurations that could result in use of chunk files and directories that belong to a different, possibly prior file system.
- Support for arbitrary data and recovery stripe combinations by using Jerasure library. Pluggable API to allow addition of other redundancy encoding schemes.
- QFS now supports striped sparse files with redundancy encoding schemes (RS). Chunk recovery no longer has to rely on logical file size, and can handle an arbitrarily-positioned end of RS chunk block within a spare file.
- Hadoop java shim: implement QFS file system object as a subclass derived from org.apache.hadoop.fs.AbstractFileSystem in order to make QFS work with Yarn / Hadoop 2.5. Rework block locations reporting with striped files. Report locations of data stripes for each striped chunk block.
- Improved chunk directory health recovery detection. The chunk directory availability tests, done with read, write, rename etc., now include IO timeout thresholds in order to avoid chunk directory oscillation between “usable” and “unusable” states.
- Chunk storage tiers. Storage tier range can be set per file. Presently no automatic background storage tier change for existing files is supported. Storage tier range is taken into account by initial chunk placement or re-replication/recovery. A file copy operation can be used to force tier assignment for chunks that already exist.
- Storage tiers can be configured for a directory (“path”). By setting a storage tier range for a directory, all subdirectories and files will inherit the directory storage tier range as default.
- Implemented IPv6 support.
Notable bug fixes and optimizations:
- Fix zero chunk size bug when chunk server stops and then starts using chunk directory again. Rework “chunk available” protocol and policies to prevent data loss in the case of cluster wide disk subsystem overload (typically due to external factors or DoS attack) resulting in sporadic IO timeouts.
- Fix chunk server write append state machine buffer flush threshold limit calculation.
- Fix race between meta server checkpoint write start and transaction log close. It caused (in very rare occasions) writing an extraneous log entry, resulting in log replay failure.
- Fix client library path cache invalidation and insertion to correctly handle concurrent i-node number changes performed by simultaneously running QFS clients.
- Optimize N * N buffer list traversal in RS recovery code. This optimization should have a very noticeable effect on chunk recovery by the chunk server as it typically has 256 buffer entries of 4KB buffers.
- Fix QFS client RS recovery to work on 32 bit systems where the highest order bit of 32 bit address is 1.
- Use a single meta server connection per client, instead of two, in order to reduce meta server’s memory and cpu consumption with large number of clients.
- Improve automatic build environment and external compilation dependencies detection.
Posted by Michael Ovsiannikov, Principal Engineer, and Wei Lu, Software Engineer
On Monday, April 7, a security vulnerability known as “Heartbleed” was discovered in the OpenSSL encryption protocol. On that same day, Quantcast’s engineering team confirmed that Quantcast.com, like half a million other websites, was vulnerable to the bug. In response, we updated our website to use the newly patched version of OpenSSL, and then pushed new SSL certificates to our servers and revoked our old certificates. All Quantcast services have been running updated versions of OpenSSL with new certificates since noon PDT on Tuesday, April 8.
While we have no reason to believe any information was compromised, it is generally a good practice for Quantcast.com users to update their passwords on Quantcast.com and, due to the scope of this vulnerability, on any sites where they might store important information. To change your password on Quantcast.com, while logged in, visit your account page, fill out the form, scroll to the bottom, and hit “save.” For more information, please reach out to us at firstname.lastname@example.org.
Posted by the Quantcast Engineering Team
Over the last few years Big Data has become more important and critical to the world. Increasingly, business value seems to be discovered by mining large amounts of data. However, the Big Data technology landscape is very fragmented. There are a number of established database/analytics providers, there’s Hadoop (in a number of flavors) and there are a myriad of homegrown systems. For instance, Facebook and Google alone have publicized six analytics tools all loosely based on the SQL language, yet different: Hive, Peregrine, Presto, Dremel, Tenzing and F1. While each of these systems solved important business problems, most of them do not follow industry standards due to development time pressure or technology limitations imposed by scalability. They do not speak exactly the same language and don’t see data the same exact way. Moreover, the data they process is often produced by programs written in a language significantly different from SQL, such as C++ or Java. When you put it all together it mostly works, except when it doesn’t. And when it doesn’t, it is often very hard to understand the error or track it down because it’s not related to a specific application code change.
Data Languages at Quantcast
At Quantcast we’ve certainly experienced some of this while processing a few exabytes of data in various formats. At first we used mostly Hadoop/Java and PostgreSQL to process data. Then we adopted Sawzall, which increased our productivity and made our data available to non-engineers. Prior to Sawzall, product managers would normally have to work with an engineer for days to analyze a data set, but with Sawzall they could now analyze large data sets themselves.
Using Sawzall, however, brought a new set of challenges. It didn’t play all that well with our Java codebase. We had to translate back and forth between four type systems: Java, protobuf, Sawzall, and PostgreSQL. We’ve seen issues with signed vs. unsigned integers, character encoding, wide vs. narrow types, and nested vs. flat structures. These issues can cause serious problems because they may go undetected at first, causing subtle behavior changes downstream.
|System||Language||Input Type||Output Type|
|MapReduce||Java||Java Subset||Java Subset|
|Sawzall||Sawzall and C++||protobuf||Sawzall|
Surviving Babylon: It Helps to Know Ahead
Given all of the risks around throwing a mix of useful but different languages at an already unwieldy Big Data problem, we wanted to highlight one of the ways we made sense of it all.
Our first measure was organizational. We standardized a single reference type system, which was essentially a subset of Java basic types and arrays that have a precise match in PostgreSQL.
On the system side, we took control of type conversions. Essentially all data passing between systems either goes through Java or is made in a binary format that is compatible with our Java type system subset. For instance, Sawzall programs produce output in a proprietary type system. We have (1) written a translator that automatically converts to protobuf and (2) limited the capability of the translator to only types that can be safely translated to our reference type system.
Perhaps the most successful measure in dealing with multiple type systems was designing the system so that type checking is 100% performed before data processing starts, and programming it to abort execution early on mismatches. In other words, if you’re going to be running a 10-hour data pipeline, it really helps to give up after seconds instead of wasting hours, when you know the data shape/types won’t match at some point. Even complex data pipelines made of a mix of steps written in Sawzall, Java and PostgreSQL are type-checked precisely 100% at execution planning time, across type systems and across languages.
Precise and Complete Type Checking Across Multiple Languages
Here is how we got full type-checking across Java, Sawzall and PostgreSQL.
- Our data pipeline engine is written in Java. The pipeline execution plan is produced entirely before starting execution. Type checking happens in Java while the execution plan is being assembled. The data record abstraction is a flat row in which each element is a basic Java type or array thereof.
- Each language must provide a record schema translator that, given a program in that language and an input record format, produces the exact output record format. All input/output record formats conform to the data record abstraction requirements. Also, each language should provide a type checking mechanism to ensure that the program is valid and matches the input/output format.
- In Java, this requirement is provided by declaring record formats as Java classes containing only basic field types (or arrays). A Java pipeline step takes as input objects of such a class and produces as output objects of such a class. We require programmers to specify input/output classes statically. This information becomes available to the checker through reflection.
- In Sawzall, this requirement is fairly easily satisfied on the input side. Flat rows containing Java types map easily into protobufs, the Sawzall input schema. We construct the protobuf definition at pipeline assembly time and compile the Sawzall program against it to continue type-checking fully within the Sawzall program. On the output side, however, the schema and types will change depending on the text of the Sawzall program. We developed a schema translator that grabs Sawzall output table schemas from the output of static analysis and converts them to record formats that can be processed by the global checker.
- We interact with PostgreSQL through JDBC. At planning time we execute the SQL query against an empty table with a schema translated from the input record format. This accomplishes three goals. First, it forces PostgreSQL to validate that the query matches the input record format. Second, it forces PostgreSQL to perform type checking throughout its expressions and aggregators. Third, it forces PostgreSQL to produce an empty result set that contains the output schema definition. We translate this back into our standard record format so the checker can carry it forward to its consumer, be it Java, Sawzall or PostgreSQL.
We’ve shown how we managed to use Java, Sawzall and PostgreSQL together to solve Big Data problems reliably. Each language has its strengths, but using them together can cause trouble. While they do seem to work together well at first sight, proving that they do work together well is both nontrivial and needed in order to provide reliability at scale. At Quantcast we have transferred hundreds of petabytes between Java, Sawzall and PostgreSQL with the certainty of early and complete type checking across components written in these three languages.
Written by Silvius Rus on behalf of the Big Data Platforms team at Quantcast.
A Brief History of Quantifying Populations
The Hollerith tabulator was invented in response to the tremendous data processing problem of the 1890 U.S. census. It was a marvel of electrical and mechanical technology, and according to Wikipedia “it spawned the data industry.”
One hundred twenty years later at Quantcast we are solving similar, though much harder data problems. They involve over a billion people and are much more complex than a census: modeling the behavior of the Internet population and comparing it to the behaviors of known buyers of products and services. In contrast to traditional advertising, we don’t group users in broad categories, but rather we make custom advertising decisions for every single Internet user with respect to each of our advertising campaigns. This allows us to be precise and accountable.
The catch is, we need to operate at Internet scale, and the Internet is vast. We trawl through an ocean of data, such as the visit patterns from billions of browsers to hundreds of millions of web destinations. The data arrives to us in chronological order as pages get visited, but we need to sort it in all kinds of ways for different uses. In our main compute cluster we spend about half of our I/O capacity sorting data. Over 90% of the data we sort is from jobs larger than 1 terabyte each. Over 10% is from jobs larger than 100 terabytes each. The largest production sort was over 400 terabytes.
We like to keep capacity ahead of needs, so one day we thought to ourselves, “we ought to sort a petabyte.”
The Scale Challenge
The only public records of sorting at petabyte scale so far came out of Google and Yahoo, because it’s a hard problem. Although sort algorithms such as quicksort were perfected long ago, they assume the data will fit in a computer’s RAM where it can be read and written very quickly. But when data doesn’t fit in one computer’s RAM you need to use external storage, which generally means spinning disks. These are much slower than RAM because you’re not just flipping transistors but sliding disk heads and waiting for disks to spin to the right position. A century after Hollerith, the largest sort operations are still gated by the quality of mechanical devices.
And it’s not just that a petabyte sort takes longer. All this spinning and sliding wears the disk drive mechanism off, which leads to data loss. A desktop computer with a single drive might suffer a failure once in five or ten years. A petabyte sort using thousands of disks will probably lose at least one before it finishes.
You also need to worry about network optimization. It takes hundreds of disk drives to hold a petabyte, more than a single computer can manage. You need a network of computers, and you’ll have to shuffle the data between them carefully to avoid network bottlenecks.
By analogy, consider sorting books in a library. Say they are at first stacked by publication date and you want to order and shelve them by author. If it’s your home library, and you can reach most books by taking a couple steps, you’ll probably do the job in a few hours. If you’re resorting the Library of Congress, you’ll need significant capacity planning, temporary storage, transportation, bookkeeping, quality assurance, an army of workers and managers to coordinate them — in other words, a robust and scalable process. And however careful you are, you’ll still probably damage a few books in the process.
Sorting a petabyte is somewhat like that. You need load balancing, temporary storage, data transport, bookkeeping, quality assurance, quite a bit of hardware and resource management. And you are not allowed to lose even a single bit of data. If you do, you need to restart from the beginning. No wonder there aren’t many reports of petabyte sorts.
Getting It Done
At Quantcast we had to deal with large amounts of data early, before Hadoop started to mature. This pushed us to develop our own processing stack, quite similar to Hadoop at interface level, but focused from the beginning on scalability to petabytes. See our previous articles on the open source distributed file system QFS and our custom MapReduce implementation.
The petabyte sort challenge was larger for us than for Google or Yahoo given that we used significantly less hardware. Also, we do not have large test clusters, so we had to run it on our production cluster, concurrent with production workload, without delaying production data set delivery. Here are the key features that made our result possible.
- The Quantcast File System. Temporary sort spills are kept on QFS rather than on the local file system. This helps two ways. First, it helps balance CPU, disk I/O and network capacity. Writing locally is preferred, but when you run out of space locally you can spread to storage on a remote node seamlessly. Second, it ensures fault-tolerance. A broken disk, a server crash and even silent disk sector corruption get detected by QFS checksums and corrected automatically.
- Quantsort, our custom-built sort architecture. Quantsort is fundamentally better than Hadoop at scheduling as it keeps tasks short while keeping the I/O size large (over 1 MB). In Hadoop MapReduce large sorts require very large tasks that run for hours. When one of these tasks fails towards the end, it needs to be restarted, which can delay the whole job by hours. With Quantsort, tasks take no more than about 10 minutes even for the petabyte sort, so hardware failures don’t cause more than a 10-minute delay even if they happen late in the process.
- Outer-track optimization. The outer edge of a disk moves faster, and therefore reads or writes data faster, than the inner edge. We use a dedicated QFS instance to store all the sort spills, set up on the outer track disk partitions across all disks, which nearly doubles our effective I/O.
Results and Comparisons
We ran a generic MapReduce job that sorted about a trillion records of varying lengths averaging about 1 kilobyte. The data was taken from our production processes, not produced synthetically. We also used production data, not synthetic data. We ran the job on one of our production clusters, concurrently with our normal production workload but at low priority to make sure it would not affect production jobs. No production SLAs were missed due to this experiment.
Our sort finished in 7 hours, 22 minutes, an hour or so slower than Google’s published result and about twice as fast Yahoo’s. See the table for details. And note that we pulled this off with a fraction of their hardware. Yahoo used three times as many drives, Google more than 10 times as many.
|Elapsed Time||6 hours, 2 minutes*||16 hours, 12 minutes||7 hours, 22 minutes|
|Uncompressed Size||1 petabyte||1 petabyte||1 petabyte|
|Sort Spill Compression Factor||1x||1x||4x|
|Setup||Dedicated||Dedicated||Shared with production|
*Google disclosed faster and larger sorts in a later announcement, but the company did not present details on its experimental setup.
The tests differed in some other ways too. Unlike Google and Yahoo we did not follow the Terasort benchmark rules and used compression during the sort, which lowered our network and I/O load. On the other hand, other production jobs were adding their own network and I/O load during the two times we executed this test. We are now using the petabyte sort as the final phase of our QFS release qualification process.
We’ve built a world-class data stack that gives Ferrari performance on a jalopy budget. Although some of the key ingredients of our petabyte sort are hard to replicate with the Hadoop stack, reserving the outer disk tracks for sort spills is a simple change that could be used to improve Hadoop performance on large jobs. We are glad to share this simple but powerful improvement with the Big Data community.
Written by Silvius Rus, Michael Ovsiannikov and Jim Kelly
It’s every advertiser’s goal to capture their customers’ attention at the right time and place, but with so much content out there across so many devices, that window of opportunity becomes more and more important. As an industry, we need to find relevance―at the risk of getting lost in the noise.
In a webinar with the American Marketing Association (AMA) and Critical Mass this week, we explained how machine learning helps drive better advertising decisions and identified the key ingredients to successfully leverage real-time display buying. Listen to our VP of Performance Engineering, Michael Recce, leverage his background in neuroscience and artificial intelligence to simplify the science behind predictive modeling, in the full recording here!
And if you have any questions about getting started with programmatic buying, please shoot a note to email@example.com.
Posted by Stephanie Park, Demand Generation Manager, Performance
On a Friday afternoon in April 2012, I was watching terabytes go by sitting at my desk across from Yerba Buena Gardens, when I saw Jim Kelly, our VP of R&D, walking towards me carrying two beers. We had had a number of conversations about large-scale analytics over the previous weeks. I could tell there was more than beer coming my way.
Big Data Equals Big Value – If You Can Make Sense of It
Quantcast has no shortage of data. Every day we ingest about 50 terabytes of new semi-structured data. More data is good. It makes our predictive models better. It gives us confidence in our research, development and operations, both on the engineering and business side. However, before we get any value out of it, 50 terabytes simply means more bits to take care of, index, process and organize. 50 terabytes a day means 4.5 petabytes a quarter, a scale at which making sense becomes a big problem. Big data is certain to cause process complexity to increase from its scale, but is not certain to be valuable, unless you are able to make sense of it timely and at a gigantic scale, either through machine learning or as a human. Jim was coming to talk about the “making sense to a human” part.
SQL Helps Change Big Data into Big Value
At Quantcast we all want to make sense of big data, but in very different ways. Sales wants to know our reach in a geographic location in Brazil. Client Services wants to double-check that campaign delivery matches the configuration. Business Intelligence and Product wants to find the next big business opportunity. Modeling Scientists want to validate model quality. Engineers want to research and validate assumptions on value distribution and invariants. One way to do all this is to write Java or C++ programs and run them over MapReduce. That will work technically for engineers, but it is out of reach to users outside Engineering. Most of these users came from other companies, where they had used SQL to mine data, and that’s what they’re used to. For non-engineers, SQL is great. It gives them direct access to data. Without SQL they need to work with an engineer A task that can be done with SQL in minutes may take a week of back and forth with an engineer. SQL is good for engineers as well. While I can write a MapReduce job in minutes, it’s tedious to write and deploy, and it’s so easy to make mistakes. SQL makes it much harder to make mistakes and is quicker to write, so it helps even engineers become more productive. These are all good things – except we did not have a way to run SQL over our petabytes of data.
SQL over MapReduce?
I was already an SQL convert, and so was Jim. We were not about to discuss whether it’s good to get SQL to run over our data, but how. “It would take several quarters to get SQL to run over our data,” I said. Jim’s face turned to a grimace. “Don’t say quarters. Is there any way you can get us access to data through SQL quickly?” I thought about it over several sips of an IPA. I like challenges but, seriously, SQL on petabytes, and quickly? Something has to give. I started to think critically about the shortest path to a somewhat usable solution. We had already been running MapReduce at petabyte scale, though we wrote Java code for every job. Could we write a generic Java MapReduce job that could execute any SQL query? Grouping and ordering can be layered over the MapReduce sorter mechanism. Joins against dimension tables can be implemented by broadcasting them to all the MapReduce workers. Joining fact tables could be layered on our join mechanism we had previously developed atop MapReduce. There was still the matter of SQL syntax, which is quite rich and hard to replicate in Java. And, to complicate matters, we also wanted to make it possible to run our Java code as User Defined Functions (UDFs), because our data is semi-structured, and thus needs to be parsed by our Java libraries. And there are some other issues that I decided we should just punt on. Neither of these issues seemed showstoppers, so I said, “okay.”
Old Horse, New Tricks
I came up with the fastest hack that allowed us to run SQL at scale on our data. It took five weeks of coding by myself. It supported almost arbitrary SQL expression syntax. It supported grouping, ordering and hash joins. (Later we added merge joins for fact tables.) Here is what came out.
It’s essentially a generic MapReduce that streamed input data to a PostgreSQL server running on the same machine. Each map/reduce process streamed its input into PostgreSQL and then executed a lightly modified version of the query, so that the execution of the complex SQL syntax gets delegated entirely to PostgreSQL. From a development perspective I wrote six modules:
- Query parser. A quick and dirty regular expression parser that only recognized anchors such as SELECT, FROM, GROUP BY and a few more. It was somewhat fragile, but it worked fine with care.
- Grouping and sorting translation to MapReduce grouping. This simply required translating SQL grouping and sorting expressions to our MapReduce sort key formulas.
- Schema translation. Our data is stored in a proprietary binary or text format described by Java classes and custom parsers. I used Java class reflection to infer the field names and types and bind them to the query. To infer the types and names of the SQL query output we execute a stripped-down version of the query on an empty table during the preflight routine on a local PostgreSQL server.
- Split and rewrite the query for the map/combine and reduce phases. Some parts of the query cannot be run in both phases. For instance, the map phase should run the GROUP BY part of the query in order to benefit from a combining effect, but it must not run the HAVING clause or it would produce incorrect results.
- Generic mapper and generic reducer. Every mapper inserts up to 4 MB of data into table T1 in the PostgreSQL instance on the same hardware host. It then continues to insert values, though into a different table T2, and at the same time starts executing the query on table T1, and emits the query result rows (possibly grouped and aggregated) to the distributed MapReduce sorter. This pipelining of insertion and execution continues until end of input. The reducer works similarly. Additionally, it ensures that the combiner query may get run repeatedly until all data for a reduce group fits in 4 MB. (If this is not possible the query gets aborted. It has not happened in practice.) Once data has been combined the final query is executed, including the HAVING clause. For most queries the input to the reducer already fits in 4 MB per group, so there is only one stage in the reducer.
- User defined functions. I used Java reflection to recognize and invoke arbitrary static functions on fields. The Java calls are made between the time when rows are read from a file to the point they get sent to PostgreSQL. This has some limitations but it worked in many cases. One example is extracting a particular value from a serialized protocol buffer passed as a string field, encoded base64.
At the end of the five weeks the code was functional at scale. We did not have a PostgreSQL specialist on staff so we treated it as a black box. As we use it only as a block streaming processor, we set it up with a limited RAM budget and made sure it never had to touch a spin disk. We did not have to change our MapReduce implementation at all. Our cluster SQL was implemented as yet another MapReduce pipeline.
We tested it all on a job that ran over hundreds of terabytes on a 1,000-node cluster with only twenty terabytes of RAM. This ensured that we pushed data through all the pipes, and it made it through correctly and consistently. All data was read from the distributed file system and written back to it. It ran on our unmodified data and it produced data in our common format, ready for use by all our other tools. Jim was happy, and the IPA was well spent.
We received great value from the five-week prototype. While there were certainly unpolished spots, such as syntax limitations and lack of custom aggregators, it bought us time, helped determine the business value by sifting through scores of petabytes, and helped us learn what our users wanted.
This initial prototype puts us well on our way to producing a full-fledged SQL-based analytics solution. There will be more about that soon, so stay tuned!
By Silvius Rus, Director of Big Data Platforms, and Jim Kelly, Vice President of R&D
Quantcast recently contributed a case study on our use of the Ganglia Monitoring System for an O’Reilly book. We think our case study provides an interesting look at our petabyte-scale infrastructure, so we’re publishing the case study as a white paper.
Here’s an excerpt to whet your appetite:
Quantcast offers free direct audience measurement for hundreds of millions of web destinations and powers the ability to target any online audience across tens of billions of events per day. Operating at this scale requires Quantcast to have an expansive and reliable monitoring platform. In order to stay on top of its operations, Quantcast collects billions of monitoring metrics per day. The Ganglia infrastructure we’ve developed lets us collect these metrics from a wide variety of sources and deliver them to several different kinds of consumers; analyze, report and alert on them in real time; and provide our product teams with a platform for performing their own analysis and reporting.
The monitoring infrastructure at Quantcast collects and stores about 2 million unique metrics from a number of data centers around the world, for a total of almost 12 billion metrics per day, all of which are made available to our monitoring and visualization tools within seconds of their collection. This infrastructure rests on Ganglia’s Atlas-like shoulders.
If you’d like to learn more about how Quantcast uses the Ganglia Monitoring System, click here to download the white paper!
- Posted by Adam Compton, Platform Operations Engineer
The Application Developers Alliance is one of the leading associations supporting app developers as creators, innovators and entrepreneurs. Quantcast is proud to be working with the Alliance to provide support and resources for its 25,000 members.
This week, the Alliance featured Quantcast developer Kevin Smith as its “Developer Spotlight” of the week. In the feature, Kevin describes his inevitable path to becoming a developer. Like many other avid gamers and developers, Kevin started coding around age 11 or 12 with QBasic. “I made everything from small text-based games to programs to help with homework,” says Kevin. The curiosity to program, and the opportunity to explore endless possibilities, is what Kevin says most developers have in common.
Kevin joined Quantcast to lead mobile measurement development. At Quantcast, Kevin works on making the Measure for Apps SDK more accessible to all kinds of app developers. Read the rest of his profile and his predictions for the next 10 years on the Alliance site here.
Posted by Maryam Motamedi, Product Marketing Manager, Quantcast Measure
Since 2006, over 100 million destinations have used Quantcast for their audience and traffic measurement. Many publishers log in to the Measure dashboard multiple times a day and have asked us for a programmatic way to access their data. That’s why we’re introducing the Quantcast Measure API (Application Programing Interface), now in Beta. The Measure API provides publishers with a flexible tool to retrieve and analyze their Quantcast data, allowing them to:
- Pull their audience and traffic data into an internal tool or dashboard
- Combine Quantcast data with other data sources for new insights
- Save audience data to track historical trends
More Customization and Insight
The Quantcast Measure dashboard provides a wealth of information, yet some publishers prefer to only see their daily people number, or already use a customized dashboard to monitor their site. With the Quantcast Measure API, publishers can choose exactly what Measure data they want to see, whether at the network, site, or subdomain level, and easily import it into their custom reports and dashboards.
The API also provides publishers new ways to analyze their Quantcast data – such as combining it with data from other sources. For instance, a travel site might analyze how the weather impacts their regional site traffic by combining Quantcast data with data from a weather provider. We’re looking forward to learning about all the creative ways publishers will use their Quantcast Measure data to generate enhanced insights.
Applying for Beta Access
Publishers with a Quantcast account can learn more about the API and apply to participate in the Beta program at our developer site here. Note that this API only provides publishers access to data for their owned properties.
Please contact us with any questions or feedback. We value your input and would love to hear from you as we expand the capabilities of our API.
Posted by Jon Katz, Product Manager, Quantcast Measure
It’s April 2011. Quantcast is running one of the largest MapReduce clusters out there. Engineers write Java code that gets executed across the whole cluster efficiently on terabytes of data. Wonderful!
Unfortunately, not everyone who needs access to data is an engineer. Also, even engineers don’t feel like writing a new MapReduce job every time they want to take a different look at a data set.
We realized we needed a more productive data analysis tool. The first thought was to get SQL to run on our petabytes of data, but that seemed like a large undertaking, as SQL data access semantics are fairly sophisticated, which implies a fairly large friction area with Quantcast’s MapReduce implementation. Another solution, simpler, was to get Google’s recently open-sourced Sawzall to run on Quantcast’s MapReduce cluster. Although as a language Sawzall is not as easy to use as SQL, especially to non-engineers, it’s still much simpler than Java. And it seemed reasonably easy to integrate Sawzall with Quantcast’s MapReduce implementation, as its interface to MapReduce is much narrower and better defined than SQL’s.
This theoretical “ease of integration with MapReduce” turned out (surprise!) harder than expected in practice. First, Sawzall was not open-sourced with a MapReduce harness, but only as a compilation/execution engine plus a command line tool with little practical utility. Second, Sawzall runs best on protocol buffers but Quantcast stores its data in different binary and text formats. Third, Quantcast’s MapReduce although based on Hadoop lacked streaming capability because it was branched off an old Hadoop version that predates streaming capability.
We integrated Sawzall execution with our cluster software by writing a generic MapReduce program that executes any given Sawzall program. The mappers and reducers stream data to and from Sawzall executors on each cluster worker machine. The data translation mechanisms (from Quantcast’s format to/from Google’s protocol buffers and the Sawzall binary output format) are pre-compiled before the job starts.
Dynamic Schemas + Multiple Type Systems = Need Dynamic Code Generation
The main difficulty came from the fact that the schema of the Sawzall output depends heavily on the content of the program, and can contain nesting and arrays. A top(10)[country] of frequency table will produce tuples with two fields, where the second field will be an array of 10 elements. This entirely dynamic schema behavior did not go well with our MapReduce framework, where schemas were determined and fixed once the MapReduce job was written. Moreover, we strongly preferred to flatten the output so that it could be exported to PostgreSQL. The solution was to deterministically flatten Sawzall output schemas. We implemented this schema flattener and educated users accordingly.
Also, when we started to draw boxes and connect them with lines we realized we had to exchange data over several type systems. Quantcast data is generally represented in the Java type system. Sawzall expects data in the Google Protocol Buffer type system, but it produces results in its own type system and encoding. We wrote code to handle all these type translations. For efficiency, rather than writing generic translation libraries, we built a code generator that created custom translation code for every job based on the Sawzall table semantics. The custom code was compiled with javac (for the Java components) and with gcc (for the C++ components) and shipped to mappers/reducers as byte code or object code, where it was loaded by each mapper/reducer at their start time.
Cool Language Wrangling, But How Well Did It Actually Work?
We could let the data speak. The numbers were collected over two years of operation.
The Quantcast Sawzall integration increased the velocity of our engineering and product teams, and helped the client services team.
Overall the Sawzall integration was a success. However, there were two main issues that over time have grown into larger problems. First, most of our code base is written in Java, while the Sawzall engine is in C++. This limits the level of interoperability and leads to code duplication and skew, which in turn leads to conflicting analytics results. Second, while the Sawzall language provided a productivity boost for scripts between 2 and 20 lines, it became hard to maintain larger scripts, especially for users outside engineering. Our current analytics direction has shifted away from Sawzall, and is based primarily on SQL + Java user defined functions and aggregators. More about SQL in a later post.
Written by Silvius Rus, Director, Big Data Platforms