Why you should monitor data at the object storage layer
Monitoring data quality directly at the object storage layer removes subtle downstream bugs and leads to faster issue detection and prevents unnecessary data processing
A major goal of data quality monitoring is to ensure that data consumers like BI dashboards, ML training jobs or data APIs have a sufficient guarantee of the quality and integrity of the backing data assets.
A lot of attention has been given to observing data quality at the database and warehouse layers. This makes sense given that endpoints consume directly from a database/warehouse and transform tasks like dbt-workflows both read from and write to them.
Nevertheless, we have seen many cases where detecting issues after data has been materialised into a datastore is a sub-optimal workflow:
Compute cycles are wasted on load and transform before data issues have been identified and subsets of data need to be re-processed
Increasingly popular formats like Parquet, Iceberg and Delta Lake allow analysts to preempt the warehouse and query directly on object storage
Non-analytics data like some types of training data may never make it into a database/warehouse but are still used for business critical applications
In this post, we outline an object-storage monitoring approach that covers the gaps identified above and facilitates a faster and cheaper path to data remediation.
Note: For the remainder of this post, we focus on AWS S3, but the concepts are broadly applicable across different forms of object storage.
Challenges
Measuring the data integrity of S3 Objects is more than just ensuring valid checksums or successful object creation. The object needs to be inspected to ensure that columns don’t have anomalies and that data isn't missing so that downstream processing is halted as soon as an issue is found.
There are a few difficulties that arise from trying to do this:
Sifting through a large number of file-ingest events (anywhere from 1000 to 100,000/day) to figure out which ones have issues is cumbersome
Managing a high volume of data sources from paid partners or external services is taxing to data engineering teams
Context about failed processing is lost when data is partially written
Poorly formatted files can be written because not all formats on object storage enforce schema
Approach
An object storage monitoring system that solves the difficulties outlined above should:
Track and compute metrics on columns broken out by dimensions
Monitor success/failure of a task consuming an object and flag issues when relevant
Aggregate metrics across files of the same dataset (i.e. a unique bucket & object prefix) to understand systematic issues
Keep track of a timeline of events that mutate an object
We propose an event-driven system that can read a variety of data formats. The event-driven approach suits this problem more than polling because it makes it easier to get a near real-time view into object health given the underlying variance in the rate of object writes. Additionally, the system would be more resistant to changes in the schedule of data writes, and have fewer wasted resources from unnecessary polling.
The Agent
As seen in the above diagram, an agent’s job is to consume all manner of S3 events (e.g. ObjectCreated:Copy or ObjectRemoved:Delete). On ObjectCreation, it calculates statistics associated with the file and forwards the relevant information to the analytics layer. The agent consumes a lightweight configuration that specifies the bucket and object prefix of a dataset along with columns and dimensions of interest.
The system treats a bucket and object prefix pair as a “dataset”. Despite the notion of traditional tables and schemata not existing on S3, generally speaking Data Engineers treat objects in a particular path structure as logically similar and expect them to share a schema.
The Analytics Layer
The analytics layer is a data store containing pre-aggregated data and an interactive query service through which one can:
Get a clearer understanding of flagged issues by digging deeper into object-level metrics through interactive queries
Flag issues with data providers and track SLA breaches
Gather failed S3 events and re-push to SQS/SNS once data is fixed
While different types of data stores can serve as a sink for this data, the design properties should be taken into account are high dimensionality, rollup support, and fast time-range filtering/pivoting. The data store serves as a pre-aggregation layer so that interactive queries can be performed without having to replay sequences of events.
For example, if ingesting Google Campaign Performance data periodically, a config could specify:
columns like
advertising_id,cpm
dimensions like
device_type
.
One could now gather the count of advertising_id
broken out by device_type
. Then, you can roll up the file-level data to the dataset
level or roll up counts across all device_types
to understand global trends over time.
Conclusion
Tracking data at the object storage level enables proactive monitoring on training datasets and analytical datasets and other use-cases for which usage of object storage is growing.
By moving data quality checks as early as possible, a significant portion of bugs - especially those stemming from data providers - can be triaged very early. This fosters greater transparency and reliability across the data organisation and prevents wasteful compute with less engineering effort.
At Lariat, we’ve designed our own version of this. Check out our product release or try it for free (self-service & no-credit card required) if you’re interested!