Since Quantcast started in 2006, we’ve always had big data processing challenges relative to our budget and patience, so we pay a lot of attention to MapReduce performance and efficiency. One of the bottlenecks we ran into related to broadcasting job configuration data, so below we describe what we learned and how you can avoid this bottleneck in your own MapReduce jobs.
By job configuration data, we mean a blob of application-specific data that mappers or reducers need in order to process the main data set. For example, you might write a web log analysis that should skip a handful of URLs. Rather than hard-code the URLs in your map workers, the typical practice with Hadoop is to construct a list of URLs on your central launch node, write it to a file in the distributed file system, and have the map workers read it from there at startup before they start processing the web logs.
At small scale this works beautifully—your code stays clean, the job runs smoothly, and adding new URLs to the list is easy. But as your data volume grows, your configuration data will tend to grow as well, and you’ll want to use more map and reduce workers to process it. By the time you’re trying to send a 100 MB configuration file to thousands of workers, the broadcast will be a severe bottleneck, and jobs will take forever to start (if they start at all), while the cluster sits mostly idle.
Your Own Hadoop Botnet
What causes the slowdown? Your job becomes a victim of your computing power. With a few thousand Hadoop workers at your command, you can slice through enormous computing problems quickly. Or you can launch a potent denial-of-service attack on your file system. While you’re waiting for your job to start doing the first, it’s actually doing the second.
Here’s how that happens. Suppose you’re sending a 100 MB configuration file to 16,000 workers on 1,000 machines. When Hadoop writes the configuration file to its distributed file system (HDFS), it provisions ten replicas of it rather than the usual three to spread read requests across more data nodes. Multiple workers running on a single machine make only one request for the configuration file and share it among themselves, so each of 10 data nodes will have to serve 100 MB to 100 clients. On a gigabit network that takes 80 seconds, during which workers will sit idle waiting for data. That’s over a week of CPU wasted.
And that’s a best-case analysis. In fact the read requests will generally not be spread evenly. Hadoop will write the first three replicas synchronously, then start the workers, and then write the remaining seven. Many workers will start before all 10 replicas are available, so the first replicas will bear an even heavier load. Replication, then, is not the answer. The more replicas you make, the fewer machines each will have to serve, but the longer the replication process itself will take.
Options for Faster Broadcasts
The broadcast bottleneck became a big problem for us once our jobs and cluster got big enough. At Quantcast many production jobs have large configuration sets and do run on 1,000 machines, so we were seeing many slow (and even failing) jobs. We had tried increasing the replication factor beyond 10x, but that didn’t work well, as most workers still pulled from the first few replicas. We also considered building a completely different broadcast mechanism, perhaps using network broadcast protocols or self-establishing trees, but those sounded operationally complicated. They would leave us with yet another system to qualify, benchmark, deploy, log, audit, maintain, upgrade, monitor and debug.
After studying the problem ourselves and talking with HDFS’s developers, we realized there’s the most leverage in making the initial write spread its data to more places, but more efficiently than simple replication does. We identified a series of simpler improvements that do just that and deliver progressively faster broadcasts:
1. Decrease the block size in HDFS. If you turn it down to 1 MB, HDFS will write a 100 MB file as 100 physical files on many different data servers. Each data server will have to serve as many clients as before, but because the file size is much smaller, it will put much less stress on its network link. You can also turn down the replication count. In our performance test (Figure 1), we used 1 MB blocks and a replication count of three and got broadcasts to run 82% faster than default Hadoop.
2. Use the Quantcast File System’s (QFS) striping feature to achieve a similar data distribution and even better performance. For our broadcasts, we use QFS with 128-way striping and 2x replication, which spreads files across 256 machines and provides some resilience to data loss. This ran another 49% faster than #1.
3. Run QFS in RAM for even faster performance. Since the broadcast data won’t be needed after the last worker reads it, persisting it to disk isn’t necessary and slows the job down. 2x replication still provides fault tolerance, should a machine go down while the job is running. This ran yet another 44% faster than #2.
Try It Yourself
The good news is, this is a high-quality problem. If you’re noticing broadcast bottlenecks, your business is probably booming, your data processing infrastructure is up and running at scale, and you’re able to focus on improving efficiency. The even better news is we’re making it easy to improve. The HDFS change just requires a parameter adjustment, and QFS is easy to download, set up as an additional file system, and use for broadcasts.
Written by Silvius Rus, Jim Kelly and Michael Ovsiannikov on behalf of the Cluster R&D Team at Quantcast