A Guide for Building High-Quality Data Products
68 pages of best practices on defining data products, testing strategies, managing ownership, and measuring data quality
Testing data well is easy at first… but then it suddenly gets pretty hard. Alerts are going off left and right in Slack channels, dozens of tests have been failing for months, and it’s unclear who the owner is.
At first, this leads to small annoyances, but over time, it erodes the team’s trust in writing tests and can leave you not much better off than when you started.
I’ve come across hundreds of teams with varying degrees of this problem — the “broken windows” of data tests.
The root cause of the problem is that data is complex. Data is not static and sometimes changes back in time. Source systems are inherently complex. And people & processes, especially at scale, are complex too. While there’s no easy solution, I believe there’s a framework that can add structure to most of these challenges.
The recipe goes as follows (should be followed chronologically)
Define data use cases as products — start by defining how data is used
Clarify ownership and severity — establish accountability and test what matters
Deploy testing & monitoring strategy — add tests intentionally and continuously optimize
Manage and resolve incidents — streamline issue resolution
Review quality metrics — continuously improve based on data-driven insights
Over the past few months, Petr and I put our heads together to create a tool-agnostic guide for implementing this framework. I think there’ll be something interesting for you in there — no matter if you work in a 5-person data team or a data team in the Fortune 500.
Download the guide in PDF here.
Here are a few of my favourite anecdotes:
Data products
The data team at Aiven started with high-level products such as Sales and Marketing but realized they needed to go a step deeper to have the most impact. — “If the Marketing data product has an issue, that may be fine. However, if the Attribution data product within Marketing has an issue, we must immediately jump on it. This is the level of detail our data products need to be able to capture.”
One pitfall when defining data product priority — especially in larger data teams with many stakeholders — is that they’ll have widely different opinions on what’s important. You should establish a set of boundaries for how these are defined. After all, not everything can be P1. We recommend working with senior stakeholders to agree on what the priority is and get their buy-in.
Testing & monitoring
Mechanically re-applied tests without awareness of the broader data pipeline do not add more safety to our data.
If we correctly monitor our dbt jobs, why do we want to deploy any freshness anomaly monitors on any tables created by models in these dbt pipelines?
Testing sources is a high-leverage activity. We are verifying the quality of data that feeds into every other model in our system, which has the most significant number of downstream dependencies and attributable usage.
Ownership & incident management
One of the top pitfalls we see is when teams spend a lot of time mapping out and defining ownership, but let it sit stale on a Confluent page that gradually gets out of sync with reality.
If you’re not specific about expectations for on-call (e.g., we only look at issues within business hours), people will start adopting different expectations which can create an uneven workload across the team.
Measuring data quality
Measuring metrics such as test model test coverage without a clear end goal in mind can create a false sense of security and lead teams to optimize toward the wrong goals