WTF is a Data Pipeline?

what is a data pipeline

David Krevitt

Lover of laziness, connoisseur of lean-back capitalism. Potentially the #1 user of Google Sheets in the world.

A Data Pipeline unifies your team’s data in one place.

It is not a specific piece of technology. It is a way of liberation.

A pipeline combines into a single database (whether that’s in Sheets, BigQuery, or elsewhere):

  1. The raw data that your team uses
  2. The standardized business logic your team applies to that data
  3. The formatting of data that’s required for reporting

A pipeline outputs visualization-ready data, which can be used in your reporting tool of choice (Data Studio, Sheets, Tableau, Looker, etc).

 

BigQuery Data Pipeline

 

Why use a data pipeline

Unifying your team’s data workflows within one pipeline gives you two unique advantages:

Speed ⚡

Everyone on the team uses the same data at all times. No more apples-to-oranges comparisons, or discussions about where the data came from.

Cost 💰

They will drastically reduce time spent on data wrangling, manipulation and analysis. Most of our clients are able to refrain from hiring full time data analyst positions.

Creativity 🧠

Users of data (marketers, salespeople, account managers, etc) are free to explore + analyze data, without concern of breaking anything.

 

How data pipelines flow

Each data pipeline has a designated data warehouse, which acts as a database and stores all of your raw + modeled data.

Our team leans on Google BigQuery as our warehouse of choice, for three reasons:

  1. It integrates seamlessly with the rest of the Google stack: Sheets, Data Studio, and Google Cloud Platform.
  2. It’s lightning fast, and can scale to any data size.
  3. It’s affordable to start: BigQuery costs start at $0, and scale up with data storage + query size.

As a general rule, if you’re currently using Google Sheets, then it’d be silly to use any data warehouse except BigQuery.

Once you’ve picked a data warehouse, 3 elements come together to form your pipeline:

 

1. Raw data feeds

Most teams we work with use marketing + sales data from a similar set of platforms:

The first order of business in building your pipeline, is deploying a means of getting this raw data into your warehouse on a regular schedule.

Each of these data sources has an API (Application Programming Interface), that allows you to pull data out of them in an automated fashion.

Pulling this raw data into your data warehouse using APIs is *nothing fancy*.

data feeds for data pipeline

There are so many tools performing this same function, that at this point we think of this “ETL” segment of a data pipeline to be a commodity.

Fivetran, Funnel, Segment, Stitch and Supermetrics all perform this basic function – each of them have a slightly different cost model, but under the hood they all do the same thing.

We’ve built enough pipelines that for many common data integrations we’ve built our own internal tools to replace the need for external SaaS tools.

So for writing raw data to your warehouse, we generally use a combination of Stitch with our own internal tools. We’re a Stitch integration partner.

 

2. Data models

Models are the business logic + data processing that need to happen to get your raw data ready for reporting.

If your team is currently working in distributed sets of spreadsheets, this data modeling step is usually the biggest drag – people process data in different ways, with different formulas, which leads to different results.

Moving to a data pipeline allows you to define your logic in a single set of SQL queries, rather than in scattered spreadsheet formulas.

In our opinion, this is the single biggest win of moving from Sheets to a data pipeline. You create one place to modify business logic, which then gets propagated to all of your reports.

For example, there’s always settings that you’ll use to process your raw data:

With a data pipeline, you can set this logic in one place, and have them used throughout the entire pipeline.

We do this using a SQL modeling framework called dbt.

With dbt, SQL models can be run on any schedule (hourly, daily, or on-demand) so that your data is always fresh + ready.

We like dbt because it brings an engineering discipline to our work – SQL models are stored in a Github repository, with full access to version history.

When SQL models run, whatever’s in the Git repo gets updated in your data warehouse, so you always know which logic is being applied.

 

3. Data visualizations

Reporting is super important – it’s the business end of the pipeline, that most people on your team will use to access data

In our opinion, the job of a data pipeline is to get data ready for reporting.

A well-built data pipeline will allow people to report on data from the tool they’re comfortable with: whether that’s Sheets, Google Data Studio, Looker, Tableau or something else.

data viz for data pipelines

In many data pipelines we’ve built for teams, they end up using the exact same reporting (usually in Data Studio) that they used before we built the pipeline.

The difference is, with a data pipeline the data is delivered to them in a standardized, automated and reliable way.

We believe data pipelines free individual people to be creative + exploratory in how they use their data, because there’s no way for them to break the underlying dataset.

 

CIFL principles for data pipelines

When building out data pipelines for digital agency, ecommerce and SaaS teams, we stick to four principles:

Own your data 🔒

To control your destiny, never store your data in any closed platform that you do not own + control outright.

Store SQL queries in a repo ☁️

No SQL query should ever be stored on a contributor’s local machine. If business logic runs as part of the pipeline, it must be visible to everyone on the team.

Separate your pipeline layers 🛠️

Pick the best tool for each job, rather than relying on “one-size-fits-all” solutions.

Pick the best way to access raw data, the best way to model it, and the best way to report on it. In almost all cases, these are vastly different tools.

Should a new best-in-class tool emerge, you have the freedom to migrate to it, without disrupting your entire pipeline.

Document everything  📝

SOP or it didn’t happen.

 

Building your data pipeline

We’re happy to help map out your options. Book some time to connect with David, the Commissioner of CIFL.

Ready to build your pipeline?

My name is David, and I help companies automate their data analysis in BigQuery. I’m standing by to chat about how we can help you get more done.

FIND OUT MORE