The business-critical data warehouse
How the data warehouse moved on from analytics and the role AI has played in accelerating this
Two years ago, in January 2022, I wrote an article titled We’ve only scratched the surface of the full potential for the data warehouse.
The article's main point is that the data warehouse's potential market size is 10-100x larger than it was then.
Although data moves fast, two years is not that long ago. The data warehouse was already well on the way to shifting from mainly being used for analytics and reporting to directly powering sales, operations, finance, and marketing. For example, marketing teams started using Hightouch to send lead lists to Marketo, and sales teams were moving to product-led sales motions where leads are scored and weighted in real time based on data in the data warehouse.
So, how has the role of the data warehouse changed going into 2024?
Despite a macroeconomic slowdown and tech valuations taking a hit, companies' investments in data warehouses kept increasing.
“Our revenue was $2.1 billion, $1.2 billion, and $592.0 million for the fiscal years ended January 31, 2023, 2022, and 2021” – Snowflake
“We crossed $1.5B revenue run rate at over 50% revenue year-over-year growth with the second quarter representing the strongest quarterly incremental revenue growth in Databricks’ history” – Databricks
For what it’s worth, people who predict numbers for a living estimate a compound annual growth rate (CAGR) of 22.7% during the forecast period spanning from 2023 to 2030.
In other words, the party is far from over for the data warehouse.
AI stole the show
While we saw more finance, sales, operations, and marketing use cases move to the data warehouse, AI and ML stole the show.
The best illustration is the chip maker NVIDIA, which grew revenue from $10.8 billion in 2020 to $26.9 billion in 2022 and tripled the stock price in the last 12 months.
While all the fuzz may be around LLMs and Gen AI, in my experience, many ML use cases are business-centric, old-school applications now accessible to more people.
These are systems such as a customer lifetime value (CLTV) model being used to more precisely predict the right amount of money to bid per customer for online ads or leads or churn scoring predictions used by sales teams. A significant portion of the ML workflow, such as data preparation, feature store building, and data quality monitoring and management, now more often happens in the data warehouse.
An example of a data warehouse-centric ML system
The increased demand for data warehouse-centric ML systems is a big deal for a few reasons:
We’ve only tapped into single-digit percentage points of most companies' total opportunity for ML use cases.
70-80% of the ML workflow has been moved to the data warehouse, from data preprocessing, collection, and data quality management.
More people can participate in the ML workflow.
Searching on LinkedIn for the title “machine learning engineer” shows around 45,000 people worldwide, compared to millions of analysts. While this number may be off by orders of magnitude, it still holds that the number of people well-versed in the data warehouse, such as analysts, analytics engineers, data analysts, and some product managers, far exceeds the number of machine learning engineers with the experience required to build end-end systems from scratch.
What does this mean
Just because we see more demand and ROI from the data warehouse doesn’t mean it’s problem-free. Among the most common frustrations I hear from data teams is the sheer complexity of their data stack. It’s not uncommon for a team of 10-20 data people to have over a thousand data models (heck, Siemens has 85,000 models). These naturally build up over time as data-hungry stakeholders ask for more insights. This complexity makes it much more difficult to understand what’s happening and makes the data error-prone and less suitable for business-critical applications.
I’m seeing teams take these steps to address this.
Splitting the analytics and business-critical stack–the combination of the complexity of the analytics stack and the need for more reliable data has increasingly led teams to split their business-critical use cases. This often means a separate database or schema, a separate dbt project, and limits on who can contribute and make edits. While for analytics use cases, you may prefer fresh data with the downside that data may not be 100% accurate, when it comes to business-critical applications such as retraining an ML model or showing customer data on a shared portal, more often than not, companies value accuracy over freshness.
Increased focus on data quality–companies have an increased focus on data quality. dbt surveyed thousands of analytics engineers and found the area where most companies are looking to increase investment is data quality–the most popular area for future investment. This is also reflected in larger organizations moving to a quality-first mindset and adopting tools like dbt.
Data as a product–companies are looking for ways to measure the reliability and SLAs of their important data and are increasingly referring to these as data products. It’s no longer good enough with finger-in-the-air estimations of how data products are performing, but clear metrics on reliability broken down by areas such as timeliness, completeness, and accuracy are becoming more sought after.
The good news is that, by and large, this makes it more exciting to work in a data role. Your work has a more direct business impact, and external mandates on reliability mean that you can take your time to build more robust products instead of throwing spreadsheets over the wall, and all of this is likely here to stay.
If you have some thoughts, let me know!