Very large data teams
What happens when data teams grow to 100s of people and 10,000s of data models and end-users
I went to dbt’s annual conference, Coalesce, and the top theme was how to run dbt in the enterprise. While it was interesting how dbt is being used by very large organizations, it highlighted another interesting point.
Very large data teams are moving from working in independent silos, with separate tech stacks and no interconnection, to operating as one unit.
Siemens – case in point
Tobi and Nuno from Siemens went on stage and shared how Siemens uses dbt cloud and the data mesh and its impact on the company-wide data evolution.
The initial reaction from the audience was, “Oooh, this is a whole different scale.” A few things that stood out from their setup
800+ dbt projects in a single dbt instance
550+ active dbt developers
2,500 daily jobs running
85,000 dbt models
And the results they had achieved are solid enough that any other company in their shoes will likely want to do the same thing
6h → 25 min reduction for the daily load to run
98% cost reduction
70,000+ end-users being served downstream
Siemens is not the only very large data team
dbt shared the now infamous stats around % of dbt project installs based on the number of models. With 5% of projects having more than 5,000 models, that’s more than 1,000 projects.
While 5% of dbt projects may seem like a side note, it doesn’t tell the true story of the number of data people working in large data teams.
How many data people can we assume work in very large teams, then?
What do we know?
We know the # of projects segmented by # of dbt models
What don’t we know?
We don’t know the average data team size based on their project's number of dbt models.
We don’t know the average number of dbt projects each team has.
In the simplest scenario, if we assume that each team only has one project, most data developers are working in very large data teams (150,000 vs 132,500 for all other teams combined).
Even if this is a gross overestimate, LinkedIn helps us reveal that teams like Siemens and Coca-Cola have more than 1,000 people in data roles, and dbt has taken an increasingly important role at the center of the stack.
It’s not entirely unbelievable that there are hundreds of companies like this, and if not, I guess there will soon be. If these companies can do what the team at Siemens did—98% cost reduction, 6h to 25 min run time, this can be a sign of what’s ahead of us in the coming years.
What does the future of very large data teams look like
I previously wrote about how data teams were getting larger. But what happens when data teams get very large?
You can no longer operate without ownership.
With >10,000 models, thousands of end-users, and 100s of data developers, everything breaks without ownership.
Development time would balloon if you had to keep a mental map of 10,000 upstream models from teams across different locations that you’ll never meet.
Error management and debugging will become unmanageable with thousands of issues, and no Slack alerting channel can be manageable at this scale.
Scale brings an increased need for specialized engineers.
Data teams will have to work more like engineers by keeping intermediate transformation logic private and only exposing well-crafted data marts.
Thoughtful architecture is needed for core data marts. The impact of architecture decisions on usability and cost & resource management will increase exponentially. This adds a need for more specialized data engineers with deep expertise.
In the article: The next big step forwards for analytics engineering, Tristan Handy (CEO, dbt) details how he thinks about what’s needed for data teams to operate at scale, and it’s well worth a read.
Teams owning their own data. This is how analytics scales – Tristan Handy.
Chad Sanderson also writes about How Scale Kills Data Teams and the necessary cultural changes that it brings.
The organizational complexity of managing changes to many different data assets all at once creates hurdles for each stakeholder. Inevitably, pointing the finger at the other side is about as far as anyone gets toward actually solving the problem - Chad Sanderson.
I'd love to speak with you if you’re in a large data team and recognize these challenges.