yes! That's why we built a framework like https://github.com/stitchfix/hamilton to help ensure that the human/team side of feature/data engineering can scale to any size! It brings in software engineering best practices and prescribes a paradigm that helps keep things ordered, no matter the scale.
I transitioned from product to data product 3 months ago when I noticed that many products our team was building were data products (APIs, data sets, dashboards, ML algos, data apps). I feel this is happening as more companies and PMs notice this as well.
The companies I've spoken to who've made the most progress are taking this approach:
1) Define model ownership (finance, marketing...) in the dbt meta tag for each dbt model
2) Define if a model is public or private in the yml file
3) Only people who work within the same domain are able to access private data models in that domain
4) If a user accesses a private model outside their own domain Airflow throws an error
Companies that I've spoken to that are going down this path are mostly still early on and building it from scratch. There's also some implementation details such as how you define user <> domain mapping that as far as I understand are being handled in different ways
We're actively thinking about capabilities in dbt that could support splitting up monolithic projects (with thousands of models) into a set of smaller projects — each of which would be faster to run, easier to reason about, have clearer lines of ownership (one project == one team), and could be treated as contracted "services" by other teams' projects.
I've had a chance to run those ideas past some users already, and I'm always looking for more. If you or any of the folks you mention above would be interested in talking, let me know
One of the symptoms of a growing data team that comes up in a lot of conversations are the long data access request workflows. They always end up with the engineer that has set up the process who becomes the bottleneck in the process.
yes! That's why we built a framework like https://github.com/stitchfix/hamilton to help ensure that the human/team side of feature/data engineering can scale to any size! It brings in software engineering best practices and prescribes a paradigm that helps keep things ordered, no matter the scale.
Great article! Any thoughts about Data Product Manager roles?
I don't have any experience working with data product managers but have spoken to some companies that have them. What's your take?
I transitioned from product to data product 3 months ago when I noticed that many products our team was building were data products (APIs, data sets, dashboards, ML algos, data apps). I feel this is happening as more companies and PMs notice this as well.
This was a brilliant read, so insightful.
> Some data teams have already started making progress on only exposing certain well-crafted data models to people outside their own team.
Do you have any more information on this? I am very interested in this.
The companies I've spoken to who've made the most progress are taking this approach:
1) Define model ownership (finance, marketing...) in the dbt meta tag for each dbt model
2) Define if a model is public or private in the yml file
3) Only people who work within the same domain are able to access private data models in that domain
4) If a user accesses a private model outside their own domain Airflow throws an error
Companies that I've spoken to that are going down this path are mostly still early on and building it from scratch. There's also some implementation details such as how you define user <> domain mapping that as far as I understand are being handled in different ways
Thanks for the great article, Mikkel!
We're actively thinking about capabilities in dbt that could support splitting up monolithic projects (with thousands of models) into a set of smaller projects — each of which would be faster to run, easier to reason about, have clearer lines of ownership (one project == one team), and could be treated as contracted "services" by other teams' projects.
Some initial discussion in this direction, including public/private models, and how a team might version their public models: https://github.com/dbt-labs/dbt-core/discussions/5244
I've had a chance to run those ideas past some users already, and I'm always looking for more. If you or any of the folks you mention above would be interested in talking, let me know
This sounds really interesting! I'll keep an eye on it
One of the symptoms of a growing data team that comes up in a lot of conversations are the long data access request workflows. They always end up with the engineer that has set up the process who becomes the bottleneck in the process.