in Engineering

Big Data at Your Command

Hadoop’s sluggish command line

We at Quantcast started using Hadoop’s file system, HDFS, in 2006. We’ve run its administration tool, hadoop fs, hundreds of thousands of times and were never happy with its perceptible lag when we hit the enter key. Stat a file, and there goes half a second. Run hadoop fs -dus, another second. Your grandmother might not understand how we could get worked up over half a second, but when building high-performance systems, half a second is a constant reminder that our system still has a little flab.

And even grandma will complain when she runs hadoop fs in a bash loop. She’ll be waiting seconds or minutes to get her directories checked, and…well…she’s not getting any younger.

The problem is that every single time you perform a file operation you start a new Java virtual machine (JVM). This takes hundreds of milliseconds and uses hundreds of milliseconds of CPU time as well. Clearly a waste of time and resources.

QFS: Far-flung storage, up-close responsiveness

In 2011 we switched entirely from HDFS to QFS, the open-source Quantcast File System. Reliability and performance went up dramatically. Yet although QFS is implemented in C++ and is lightning fast, all our command line interaction with it was still sluggish. We were leveraging QFS’s Hadoop compatibility and going through Hadoop’s command-line tools. In other words, we were still starting a fresh JVM every time we typed hadoop fs.

We recently completed development of a command line tool called (appropriately) qfs, a C++ rewrite of hadoop fs for QFS, which makes QFS much quicker than HDFS from the command line. We compared timings accessing QFS via the HDFS command-line tool and via the new QFS native tool. The stat test timed a single hadoop fs -stat command. The script test timed a simple script that used the -stat, -touchz and -ls commands to check ten directory paths, create missing ones and list them.

The differences are vast:

The new qfs tool speeds up operations by two orders of magnitude. It finishes almost before it starts.

Note that you can also use QFS+FUSE (Filesystem in Userspace) to access your files through known UNIX commands, such as cat, grep, ls or find. The FUSE code is entirely C++ and thus does not require a JVM, while HDFS+FUSE does.

If you’d like to use the qfs tool yourself, you can download it from github, build it and run it. What you can’t do is park this new sports car in your old garage. Although the QFS’s back end supports HDFS’s client, the reverse isn’t true. The high-performance qfs tool requires the high performance QFS file system. Here’s another reason to give it a try.

Written by Silvius Rus, Jim Kelly and Michael Ovsiannikov on behalf of the Big Data Storage Team at Quantcast