Speed and agility in data analysis can be a big competitive advantage in business. We at Quantcast are always hungry for up-to-date information about key business metrics. So hungry, in fact, that we decided to take an aggressive step and empower our internal business users to work with their data faster and more efficiently via a custom self-serve analytics platform. This post covers how we pulled that off.
In thinking about our desired self-serve analytics platform, we came up with the following requirements:
- Remove dependency on engineering teams to enable custom data sets
- Keep the solution code-free
- Enable a Single Source of Truth for our key business metrics
- Help users upload and act on data within minutes (rather than days)
- Facilitate uploading of the data by means with which the users were already familiar (e.g., CSV/Excel)
- No learning curve in terms of understanding syntax to create a table, alter a table, etc.
Our analysis found most existing solutions to be either an overkill or overly complicated for non-engineering users, and more expensive than what our needs would dictate, so we decided to roll our own.
In accordance with our requirements, we needed to keep our user interface relatively simple. We initially considered building our own UI, but after examination, we realized that the existing AWS console tools had a sufficiently gentle learning curve.
We also decided to restrict the size of data our users could upload on their own using this solution, primarily to protect our infrastructure.
The solution that we implemented was fairly simple:
- Require that data is in CSV files that fit three requirements:
- File name should be of the format <username>_<business_unit>_<dataset_name>.csv
- File should contain a header row (We used this to name the columns in the table)
- File size should not exceed [X] GB
- Users upload the CSV file to a predetermined S3 location
- A back-end process periodically grabs these files; determines the table structure, column names, and data types; and creates the table on a specific schema in our Data Warehouse (RedShift)
- A notification is sent to the user that this data is now available for querying
We implemented this using a combination of scripts and Pentaho PDI community edition. Scripts were primarily used as a wrapper for the PDI transformation.
The <business_unit>_<dataset_name> is used as the name of the table that we created. is used for sending out notifications when the upload succeeds or fails.
On the back end, the orchestration works as follows:
- Poll the S3 location for newly uploaded files
- On detection of a new file, pull the file down to an EC2 instance
- Validate that the structure meets the contract defined above
- Sample the data set to identify data types such as varchar, boolean, int, etc.
- Generate DDL (i.e., table definition)
- Connect to DW, create the table, load rows using copy command
- Email notify the user that uploaded the file on success or failure during any of the above
Experimentation vs. Production-Ready Data
Though the primary use case for Bring Your Own Data (BYOD) was quick experimentation, our users also wanted a way to productionalize data that they deemed as ready for enterprise metrics. To do this, we implemented two flows:
- Prototype flow — Default mechanism: allows users to change the structure of the tables created by adding or removing columns to the CSV file
- Production flow — Graduation flow: fine-tune the data structure to add distribution and sort keys, as well as standardize the structure of the table
As part of our initial launch, we had less than 500 set uploads per month. Due to the ease of use and the amount of time this feature saved our users, we are currently at about 2500 uploads per month. In one of our recent DW user satisfaction surveys, BYOD was rated as one of the best features. Our analysts are now able to roll out/change metrics 3 to 4 times faster, and we now have some of our critical sales pipeline metrics populated through this pipeline.
In the near future, we’ll provide a web app through which users can preview and upload files interactively, as well as schedule imports, provide webhooks to set up data pulls directly from APIs, and leverage Spectrum query service to allow querying large data sets directly from S3.
Sound fun? Join our team!