A Data Pipeline unifies your team’s data in one place.
It is not a specific piece of technology. It is a way of liberation.
A pipeline combines into a single database (whether that’s in Sheets, BigQuery, or elsewhere):
A pipeline outputs visualization-ready data, which can be used in your reporting tool of choice (Data Studio, Sheets, Tableau, Looker, etc).
Unifying your team’s data workflows within one pipeline gives you two unique advantages:
Everyone on the team uses the same data at all times. No more apples-to-oranges comparisons, or discussions about where the data came from.
They will drastically reduce time spent on data wrangling, manipulation and analysis. Most of our clients are able to refrain from hiring full time data analyst positions.
Users of data (marketers, salespeople, account managers, etc) are free to explore + analyze data, without concern of breaking anything.
Each data pipeline has a designated data warehouse, which acts as a database and stores all of your raw + modeled data.
Our team leans on Google BigQuery as our warehouse of choice, for three reasons:
As a general rule, if you’re currently using Google Sheets, then it’d be silly to use any data warehouse except BigQuery.
Once you’ve picked a data warehouse, 3 elements come together to form your pipeline:
Most teams we work with use marketing + sales data from a similar set of platforms:
The first order of business in building your pipeline, is deploying a means of getting this raw data into your warehouse on a regular schedule.
Each of these data sources has an API (Application Programming Interface), that allows you to pull data out of them in an automated fashion.
Pulling this raw data into your data warehouse using APIs is *nothing fancy*.
There are so many tools performing this same function, that at this point we think of this “ETL” segment of a data pipeline to be a commodity.
Fivetran, Funnel, Segment, Stitch and Supermetrics all perform this basic function – each of them have a slightly different cost model, but under the hood they all do the same thing.
We’ve built enough pipelines that for many common data integrations we’ve built our own internal tools to replace the need for external SaaS tools.
So for writing raw data to your warehouse, we generally use a combination of Stitch with our own internal tools. We’re a Stitch integration partner.
Models are the business logic + data processing that need to happen to get your raw data ready for reporting.
If your team is currently working in distributed sets of spreadsheets, this data modeling step is usually the biggest drag – people process data in different ways, with different formulas, which leads to different results.
Moving to a data pipeline allows you to define your logic in a single set of SQL queries, rather than in scattered spreadsheet formulas.
In our opinion, this is the single biggest win of moving from Sheets to a data pipeline. You create one place to modify business logic, which then gets propagated to all of your reports.
For example, there’s always settings that you’ll use to process your raw data:
With a data pipeline, you can set this logic in one place, and have them used throughout the entire pipeline.
We do this using a SQL modeling framework called dbt.
With dbt, SQL models can be run on any schedule (hourly, daily, or on-demand) so that your data is always fresh + ready.
We like dbt because it brings an engineering discipline to our work – SQL models are stored in a Github repository, with full access to version history.
When SQL models run, whatever’s in the Git repo gets updated in your data warehouse, so you always know which logic is being applied.
Reporting is super important – it’s the business end of the pipeline, that most people on your team will use to access data
In our opinion, the job of a data pipeline is to get data ready for reporting.
A well-built data pipeline will allow people to report on data from the tool they’re comfortable with: whether that’s Sheets, Google Data Studio, Looker, Tableau or something else.
In many data pipelines we’ve built for teams, they end up using the exact same reporting (usually in Data Studio) that they used before we built the pipeline.
The difference is, with a data pipeline the data is delivered to them in a standardized, automated and reliable way.
We believe data pipelines free individual people to be creative + exploratory in how they use their data, because there’s no way for them to break the underlying dataset.
When building out data pipelines for digital agency, ecommerce and SaaS teams, we stick to four principles:
To control your destiny, never store your data in any closed platform that you do not own + control outright.
No SQL query should ever be stored on a contributor’s local machine. If business logic runs as part of the pipeline, it must be visible to everyone on the team.
Pick the best tool for each job, rather than relying on “one-size-fits-all” solutions.
Pick the best way to access raw data, the best way to model it, and the best way to report on it. In almost all cases, these are vastly different tools.
Should a new best-in-class tool emerge, you have the freedom to migrate to it, without disrupting your entire pipeline.
SOP or it didn’t happen.
We’re happy to help map out your options. Book some time to connect with David, the Commissioner of CIFL.