in Engineering

Quantcast File System (QFS): Version 1.1.4

QFS_Blog_Header_1000x340

The Quantcast File System (QFS) was first introduced in early 2013 as an alternative to the Hadoop Distributed File System (HDFS) for large-scale batch data processing and storage. Today we’re introducing 1.1.4 to the community, with a number of substantial improvements and new features including 1) network communication integrity and confidentiality and 2) user authentication and authorization.

Network Communication Integrity and Confidentiality

QFS can now encrypt network traffic from the client to both meta and chunk servers, in addition to between meta and chunk servers. This helps to ensure that traffic is not snooped or altered in transit, and allows clients to communicate securely with meta and chunk servers across the open Internet.

User Authentication and Authorization

QFS now supports validating that the user attempting to access a resource is who they say they are (as verified by Kerberos or an x509 certificate) and also enables Unix-style file permissions.

To address the higher CPU requirement that comes with the authentication and encryption features, the chunk server can be configured to utilize multiple CPUs to serve client requests and perform data replication and recovery. This feature can be used with or without QFS security enabled.

Additional Enhancements

This release also delivers a number of enhancements designed to simplify QFS administration and make QFS deployment less error prone. Please refer to the release notes below for more details.

New features since QFS 1.0.2:

  • QFS security modules: network communication integrity and confidentiality achieved through encryption, and user authentication and authorization. Supported authentication protocols: Kerberos 5 and x509 certificate chains with TLS. The detailed design document is available here.
  • Chunk servers can now be configured to use multiple CPUs to handle network IO, authentication, network communication encryption, and chunk recovery.
  • Automatically create and use a unique file system ID. This guards against misconfigurations that could result in use of chunk files and directories that belong to a different, possibly prior file system.
  • Support for arbitrary data and recovery stripe combinations by using Jerasure library. Pluggable API to allow addition of other redundancy encoding schemes.
  • QFS now supports striped sparse files with redundancy encoding schemes (RS). Chunk recovery no longer has to rely on logical file size, and can handle an arbitrarily-positioned end of RS chunk block within a spare file.
  • Hadoop java shim: implement QFS file system object as a subclass derived from org.apache.hadoop.fs.AbstractFileSystem in order to make QFS work with Yarn / Hadoop 2.5. Rework block locations reporting with striped files. Report locations of data stripes for each striped chunk block.
  • Improved chunk directory health recovery detection. The chunk directory availability tests, done with read, write, rename etc., now include IO timeout thresholds in order to avoid chunk directory oscillation between “usable” and “unusable” states.
  • Chunk storage tiers. Storage tier range can be set per file. Presently no automatic background storage tier change for existing files is supported. Storage tier range is taken into account by initial chunk placement or re-replication/recovery. A file copy operation can be used to force tier assignment for chunks that already exist.
  • Storage tiers can be configured for a directory (“path”). By setting a storage tier range for a directory, all subdirectories and files will inherit the directory storage tier range as default.
  • Implemented IPv6 support.

Notable bug fixes and optimizations:

  • Fix zero chunk size bug when chunk server stops and then starts using chunk directory again. Rework “chunk available” protocol and policies to prevent data loss in the case of cluster wide disk subsystem overload (typically due to external factors or DoS attack) resulting in sporadic IO timeouts.
  • Fix chunk server write append state machine buffer flush threshold limit calculation.
  • Fix race between meta server checkpoint write start and transaction log close. It caused (in very rare occasions) writing an extraneous log entry, resulting in log replay failure.
  • Fix client library path cache invalidation and insertion to correctly handle concurrent i-node number changes performed by simultaneously running QFS clients.
  • Optimize N * N buffer list traversal in RS recovery code. This optimization should have a very noticeable effect on chunk recovery by the chunk server as it typically has 256 buffer entries of 4KB buffers.
  • Fix QFS client RS recovery to work on 32 bit systems where the highest order bit of 32 bit address is 1.
  • Use a single meta server connection per client, instead of two, in order to reduce meta server’s memory and cpu consumption with large number of clients.
  • Improve automatic build environment and external compilation dependencies detection.

Posted by Michael Ovsiannikov, Principal Engineer, and Wei Lu, Software Engineer