We’ve submitted a patch to the Ganglia Monitoring System project to add support for compressing a batch of metrics before sending them to the collector service. Since we collect around 140,000 metrics per second (almost 12 billion metrics per day), even a small reduction in size adds up. This, however, is no small reduction; this patch reduces the amount of data transmitted by up to 92%.
This blog post will discuss the nature of the change and why we needed it, and talk a little bit about the technical details.
And That’s The Way It Is
Ganglia operates by running a Ganglia monitor daemon (gmond) on each machine whose metrics you’re interested in collecting, and then running a Ganglia collector daemon (gmetad) to retrieve those metrics and persist them in a central location. The metrics are transmitted as a big XML tree:
GANGLIA_XML | - CLUSTER | - HOST | | | - METRIC | - METRIC | - METRIC | - HOST | | | - METRIC | - METRIC ...
This structure is easy to analyze and debug, but it’s not particularly space-efficient, especially when sending a lot of metrics over a long-distance link very frequently. Compressing the XML tree seemed like an easy improvement to make with a large potential benefit.
Motivation: WANs Are Expensive
Quantcast has datacenters located all over the world. Using Ganglia to collect metrics from these datacenters means sending a lot of traffic across WAN connections, which are usually pretty expensive (and often slow and high-latency). Adding the ability to compress these metrics before sending them reduces the amount of raw data we send over these links, which saves us money. Also, transmitting less data over a high-latency link gets the metrics into the collector daemon that much faster, which improves our capability to monitor our global infrastructure in real time.
The patch has two main components: changing gmond to optionally gzip its metrics before sending them, and changing gmetad to automatically detect a compressed stream and decompress it before processing.
We added an option to
gmond (-z) that causes it to compress its XML tree with gzip before emitting it. Since gmond was already using APR, we implemented this change by storing the compressed data in the per-socket data structure APR provides. This enabled us to make a very clean patch that makes as few changes as possible to the existing gmond code paths.
gmetad (both the C and the Python versions) to automatically detect a compressed stream. Upon receiving the data stream,
gmetad looks for the gzip header bytes as defined in RFC 1952; if those bytes are detected,
gmetad attempts to decompress the stream and then resumes processing as normal. This patch also involved very little change; if those bytes are not detected,
gmetad behaves exactly as it did before.
We’ve made several other changes to Ganglia, and we’re planning to continue cleaning them up and submitting them back to the community. We’re hopeful that these changes will make Ganglia a more useful tool for everybody.
Did this post increase your interest in Quantcast by 92% as well? Come work with us and see what improvements you can make, both internally and shared with the world.
Posted by Adam Compton, Platform Operations Engineer