Spreadsheets: The Duct Tape of the Modern Data Stack
How to build robust data pipelines when spreadsheets are involved
“So, what does your data stack look like?”
Head of Data: “Glad you asked! We run a CI/CD pipeline with 80% test coverage in dbt and have end-end observability”
Or at least that’s how the story goes. If you peek under the hood of a modern company's data stack you’ll most likely find all the usual suspects; dbt, Airflow, Fivetran, BigQuery and Looker. But one guest always shows up to the party uninvited. Hello Mr. Spreadsheet.
Like it or not, spreadsheets play a critical role in holding the data stack together and we need to adjust to that reality.
Spreadsheets are the duct tape of the data stack. Something everyone knows how to use, solves an imminent problem but is often not a durable solution.
Need to import revenue targets from the finance team for FY2022 into Looker? Hello spreadsheet
Need the sales team to create a custom mapping between countries and sales clusters? Hi there spreadsheet
Need the support team’s manually grouped customer feedback on NPS comments imported into the data warehouse? Howdy spreadsheet
Want to build a custom quality assurance system based on Google Forms and spreadsheets that hundreds of people use (more on this later)? 👋 spreadsheet
Importing spreadsheets is neither bad nor good but we should think more carefully about the role they play and how to use them responsibly.
Spreadsheets are not so bad… maybe
When I was in high school I took a class on Microsoft Access. I’ve since forgotten everything about it but Googled it when I was writing this post. In theory, Microsoft Access had a great vision
“Create your own database apps easily in formats that serve your business best” - Microsoft Access
In practice, spreadsheets combined with the Modern Data Stack offer a better alternative.
Spreadsheets are the interface that allows anyone to quickly and easily bring data into the data warehouse.
Importing data from spreadsheets into the data warehouse has a level of product/market fit you can only envy and willingly makes us jump through hoops.
Spreadsheets are easy to use: In the same way that Figma has invited the entire company into the design process, spreadsheets do the same for data. They allow anyone across operations, sales or customer support to contribute to the data stack by creating new groupings, raw input and data that goes straight into the dashboards they use every day
Spreadsheets are quick to use: Need a new mapping from country to sales pod? There’s a spreadsheet for that and you can import it into your data warehouse in less than five minutes
Spreadsheets are the language of most functions: What does finance, operations and FP&A have in common? They all use spreadsheets. Those targets from the finance team that you need to import into Looker to show how your region is performing? You guessed it, they’re already in a spreadsheet
Spreadsheets spark creativity: Spreadsheets can be used in creative ways to build business applications. At Monzo, one of our analysts built an entire quality assurance system that automatically assigns tasks for manual review for hundreds of support agents, collects data from Google Forms and imports it into BigQuery and Looker
With great ease-of-use comes great responsibility; spreadsheets are often overused, contribute to hard-to-reproduce datasets and are prone to manual input errors.
Spreadsheets are too easily modified. Although access to spreadsheets can be restricted, more often than not too many people have edit access. This makes spreadsheets prone to manual input errors which break data pipelines or introduce unnoticed data quality issues to the data warehouse
Spreadsheets are not (really) version controlled: Want to reproduce a data pull you did one year ago that has a dependency on a spreadsheet? Chances are that someone has edited the spreadsheet since then and you can’t get the numbers to match up. Spreadsheets technically have their own take on version control but in practice this is not nearly as good as Git-based version control for chasing down issues in your data pipelines
There are some technical gnarly challenges that hint to that importing spreadsheets was not always thought of as a first-class citizen in the data stack
BigQuery’s auto-detect schema is flaky: Ever tried to create a spreadsheet import of a sheet with hundreds of columns in BigQuery? If so, you’ve probably realised that Google sometimes gets the column type and name wrong leaving you with a lot of manual work
The physics of spreadsheets get in the way: If you get really creative you may end up in situations where the five million cell limit in spreadsheets gets in the way
Six practical spreadsheet tips from the trenches
Until someone invents a better tool to get user generated input to the data warehouse we’re stuck with spreadsheets - and maybe that’s not so bad. Here are six tips to make life easier when using spreadsheet as part of your data stack.
Keep edit access to imported spreadsheets restricted. Otherwise, you can be sure someone will find it in your company-wide Google drive folder and make changes that break your pipeline that you may never find out about
Name changes for important spreadsheet updates. Google sheets let you name versions of spreadsheet changes. Doing this will do wonders if you come back to investigate why a metric has changed one year later
Convert the spreadsheet import to a JSON array to have a version controlled lookup table that is persistent through time. Despite a little extra effort, this allows you to keep the data version controlled and easily reproducible
Write tests for imported spreadsheets. If you’re relying on a spreadsheet for important data models, treat it as such. Write tests against the spreadsheet by specifying expected and acceptable values and get notified if they’re breached
Have a business owner who’s responsible for quality. This should be an end-user of the spreadsheet logic who can spot if something is out of the ordinary. As a data person you easily end up supporting dozen of spreadsheet imports and may never catch subtle changes otherwise
(techy tip) Column generator. Remember that nasty bug above where BigQuery doesn’t autogenerate a scheme from the spreadsheet? Ben from Monzo has created this schema generator template to the rescue
What’s next - a wish list for the Modern Data Stack
In The Dark Knight the police commissioner says of Batman: “... he’s the hero Gotham deserves, but not the one it needs right now”. Spreadsheets are both the hero we deserve and need and the best tool we have for letting anyone into the data process.
The Modern Data Stack can learn a lot from spreadsheets.
Spreadsheets democratise access to inputting data to the data warehouse in a way that’s both fast and reliable so the data team doesn’t have to get involved with each small change.
Spreadsheets invite the rest of the company into the data process so data quality and availability is owned by everyone and not just the data team.
Spreadsheets provide better reproducibility in a way that lets anyone see visually what’s changed without being an expert in data lineage or Git.
Spreadsheets give anyone with good ideas the tools to do something about it. The quality assurance system I mentioned earlier is one example. We need more tools that get out of the way and let anyone be creative no matter their level of technical expertise.
If you have any tips on how to make spreadsheet imports more reliable let me know.