Working with big data isn’t easy — or cheap.
Big data typically demands massive amounts of storage and computing, electricity and space to run and cool all the hardware, plus ample staffing and resources. As any organisation that works with large data clusters in production can tell you, the associated costs can multiply quickly, making efficiency paramount.
Introducing QFS 1.0
Developed at Quantcast and now released to the open source community, the Quantcast File System (QFS) is an alternative to the Hadoop Distributed File System (HDFS) for large-scale batch data processing. It is a production hardened, 100% open-source distributed file system that is fully integrated with Hadoop and delivers significantly improved performance while consuming 50% less disk space.
Why Open Source?
File systems are critical infrastructure and need to be solid. We believe the scrutiny and collaboration the open source community can provide are the best way to advance QFS, while providing a huge benefit in return to other organisations that adopt it.
We welcome your questions, comments and contributions to the QFS project and look forward to collaborating on its continued evolution. We’ll be checking our changes into github and packaging new releases periodically.
More Processing Power — Less Hardware
Compact data storage requirements means fewer hard drives to purchase and power. Faster data throughput means more power, and better results. We built QFS to deliver both.
Key Features of QFS
- Reed-Solomon (RS) error correction. Unreachable machines and dead disk drives are the rule rather than the exception on a large cluster, therefore tolerating missing data is critical. HDFS uses triple replication, which expands data 3x. QFS uses only half the disk space disk space by leveraging the same error correction technique CDs and DVDs do, providing better recovery power with only a 1.5x expansion.
- Higher write throughput. Leaner data encoding doesn’t just save disk space, it means less data to write. Since every job on QFS writes only half as much physical data, it puts half as much load on the cluster. Jobs write data faster, and more can run at the same time.
- Faster reads. A hard drive is the slowest component in a cluster, with a top read speed of about 50 MB/s. HDFS reads each data block from a single drive and therefore inherits the same speed limit. QFS reads every block from six drives in parallel making its top theoretical read speed 300 MB/s. This translates into a big speed boost for real world jobs.
- Direct I/O. The fastest way to read data from (or write data to) a drive is in large, sequential bursts. Normal file I/O APIs permit the operating system to buffer data and swap disk time between different processes, which breaks up big, efficient bursts into small, inefficient ones. QFS uses low-level APIs that give it more control to ensure that disk access stays optimal.
- Fixed memory. QFS is implemented in C++ and carefully manages its own memory within a fixed footprint. That means fast operations, with no interrupts for garbage collection. It’s also a good neighbour to other processes on the same machine, as it never asks the OS for more memory at the risk of swapping and extra disk activity. QFS’s memory management helps keep performance high and administration simple.
- Proven reliability. Reliability is critical for a file system and earned only through time and hard work in a suitably demanding environment. Quantcast’s data processing demands have grown steadily with our business. We receive over 40 terabytes daily, and our daily MapReduce processing can exceed 20 petabytes.