Published on

Learn Just Enough BigQuery SQL

Authors

The day will come when you reach the end of the spreadsheet.

When your Sheets become too overloaded with data and formulas to carry on.  When your Sheets pass the 5 million hard cap on cells.

But there is life after Sheets.

You don’t need to pack up your Sheets, quit your job and burn your credit cards - you can just upgrade your toolbelt to work with larger datasets.

Enter the wonderful world of cloud data warehouses, data lakes, lakehouses and penthouses - offering unlimited data analysis power with lightning speed.

If you keep reading, I promise you will learn to write your first SQL query (in Google's BigQuery cloud data warehouse) today, using the Google Analytics sample dataset.

Below are 13 video tutorials to get you up and running, which are excerpted from our free course, Getting Started with BigQuery. If you're interested in following along with the quizzes below, login or signup to the course.

Let’s dive in!

You may know more than you think

If you already know the Google Sheets query function, you’re more than halfway to writing SQL in BigQuery.

The query function syntax is like so:

=query(range, “SELECT * WHERE x = y”)

In BigQuery SQL (and most other forms of SQL), the only key difference is that you reference a table (with a FROM parameter), instead of a spreadsheet range:

SELECT * FROM table WHERE x = y

Other than that, you’ll find the logic ( AND / OR ) and math syntax to be very similar.

Access the Google Analytics sample dataset

Before starting to use BigQuery, you must create a project.

You should see the $300 free trial offer pop up if you’re creating your first Google Cloud project, so there’s no risk of you being billing as part of this tutorial.

Even if that offer doesn’t show up, the data queried via the Google Analytics sample dataset is so small it falls within BigQuery’s free tier.

Once that’s up and running, you can access the Google Analytics sample dataset here.

Editor's note: Google has updated the BigQuery UI since these videos were recorded - but fear not! The SQL syntax covered here is still the same.

Writing your first SELECT query

Let’s break down a basic SELECT query, pulling visits, transactions and revenue by channel from our Google Analytics dataset:

SELECT  
date,
channelGrouping as channel,
totals.visits,
totals.transactions,
totals.transactionRevenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
LIMIT 1000

Each SQL query must contain at least 2 parameters:

  • SELECT: defines the columns you’d like to pull
  • FROM: defines the table to pull them from

Throughout this walkthrough, we’ll be focusing on the holy trinity of marketing metrics: visits, transactions and revenue (from which you can calculate conversion rate and AOV):

SELECT  
date,
channelGrouping as channel,
totals.visits,
totals.transactions,
totals.transactionRevenue

You can rename any column using ‘as’ (see channel above), if you’d rather use a column name different from the one present in the database.

For the FROM parameter, in BigQuery there are 3 layers included in each table name:

  • Project ID
  • Dataset
  • Table

They come together as project-id.dataset.table - in our example:

FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 

The LIMIT parameter above defines the number of rows to return - including a limit is just a good SQL practice, even though for BigQuery it’s not really necessary.

Keep in mind that order is CRITICAL with these parameters, there’s an order of operations just like arithmetic.

SELECT is always first, then FROM, and so on as we go through these examples (the order in the examples is always the order you’ll want to use).

Filtering data with WHERE

Adding a WHERE parameter to our query allows us to filter our results based on specific logic.

For exmaple, what if we wanted to pull GA sessions for only the “Organic Search” channel?

Adding to our basic SELECT statement above, we’d layer on a WHERE parameter:

SELECT  
date,
channelGrouping as channel,
totals.visits,
totals.transactions,
totals.transactionRevenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
WHERE channelGrouping = 'Organic Search'

Following the WHERE parameter, you can introduce any logic just like you would in an IF formula: !=, <, <=, >, >=.

You can even pull multiple values using “in”:

WHERE channelGrouping in ('Direct', 'Organic Search')

To add a second logic statement after your initial WHERE, you simply add an AND or OR (WHERE is only for the first bit of logic):

WHERE channelGrouping in ('Direct', 'Organic Search') AND date = '20170801'

Ordering results with ORDER BY

Often you’ll want to display results in a specific order.

Building on our query above, what if we wanted to display our most lucrative (highest revenue) hits first?

You’d add an ORDER BY parameter to the end of your query, like so:

SELECT  
date,
channelGrouping as channel,
totals.visits,
totals.transactions,
totals.transactionRevenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
ORDER BY totals.transactionRevenue desc
LIMIT 1000

The basic structure of an ORDER BY parameter is:

ORDER BY columnname direction (either asc for ascending or desc for descending)

If you don’t truly need to order results in a certain way, then you can leave out the ORDER BY - it can be an unnecessary drain on performance when running large queries.

Calculating aggregate totals with GROUP BY

Most of the time, you won’t just need to query out your raw data - you’ll want to perform some aggregate math across a slice of your dataset (by channel, device type, etc).

For example, what if want to sum visits, transactions and revenue by channel?

There are two changes required to your query to make this happen: * Wrap the columns you want to run math on in an aggregate function - SUM(), COUNT(), COUNT(DISTINCT()), MAX(), or MIN() * Add a GROUP BY parameter after your WHERE logic - all of the columns not being aggregated must be present in the GROUP BY

Let’s take a look:

SELECT  
channelGrouping as channel,
sum(totals.visits) as visits,
sum(totals.transactions) as transactions,
sum(totals.transactionRevenue) as revenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
WHERE channelGrouping in ('Organic Search', 'Direct')
GROUP BY channel
ORDER BY transactions desc
LIMIT 1000

Notice how since we’re only grouping by channel, all of the other metrics (visits, transactions, revenue) are wrapped in a SUM function.

Writing arithmetic within queries

You’ll often want to calculate metrics based on your metrics: for example conversion rate (transactions / visits), or average order value (revenue / transactions).

You can do that math inline right in your query, by using +, -, * or /.

See the conv_rate and aov columns below:

SELECT  
date,
channelGrouping as channel,
sum(totals.visits) as visits,
sum(totals.transactions) / sum(totals.visits) as conv_rate,
sum(totals.transactions) as transactions,
sum(totals.transactionRevenue) / sum(totals.transactions) as aov,
sum(totals.transactionRevenue) as revenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
GROUP BY date, channel
ORDER BY transactions desc
LIMIT 1000

Division can be tricky though, since if you divide by zero your query will throw an error.

To do division safely in queries, you can wrap them in what’s called a CASE statement, to only run the math if the denominator is greater than 0:

CASE WHEN sum(totals.visits) > 0 THEN sum(totals.transactions) / sum(totals.visits) ELSE 0 END as conv_rate

CASE statements are very useful - basically the same as an IF statement in Sheets. You can add multiple WHEN / THEN conditions to mimic a nested IF statement.

For now, to perform division you can just use that basic CASE syntax above, to check that the denominator is greater than 0 before running the math.

So the final query to calculate conversion rate and AOV would look like:

SELECT  
date,
channelGrouping as channel,
sum(totals.visits) as visits,
CASE WHEN sum(totals.visits) > 0
  THEN sum(totals.transactions) / sum(totals.visits) 
  ELSE 0 END as conv_rate,
sum(totals.transactions) as transactions,
CASE WHEN sum(totals.transactions) > 0 
  THEN sum(totals.transactionRevenue) / sum(totals.transactions) 
  ELSE 0 END as aov,
sum(totals.transactionRevenue) as revenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
GROUP BY date, channel
ORDER BY transactions desc
LIMIT 1000

If you don't mind your division statement returning null (rather than 0) when your divisor is 0, you can also use SAFE_DIVIDE(numerator, denominator) to save yourself the CASE statement.  

Aggregating by day, week and month

If you’re working with marketing data, looking at changes over time will be critical for you.

Thankfully, SQL has built-in date functions to make that easy. Let’s try grouping sessions by day of the month, week of the year, and month + year.

The key functions are: * EXTRACT(DATE PART from column) - DATE PART can be DAY, WEEK, MONTH, YEAR, and more (full docs here) * FORMAT_DATE(“date syntax”, column) - date syntax can be %Y-%m for year and month (full docs here)

Let’s see those in action:

SELECT
date,
EXTRACT(DAY from date) as day_of_month,
EXTRACT(WEEK from date) as week_of_year,
FORMAT_DATE("%Y-%m", date) AS yyyymm

Note that due to a nuance in the sample GA dataset (the date being formatted as a string instead of a date), you’ll actually have to first use the PARSE_DATE function (docs here) to get the date column into a true date format before running the EXTRACT and FORMAT_DATE functions:

SELECT
date,
EXTRACT(DAY from date) as day_of_month,
EXTRACT(WEEK from date) as week_of_year,
FORMAT_DATE("%Y-%m", date) AS yyyymm,
totals.visits
FROM (

  SELECT  
  PARSE_DATE('%Y%m%d', date) as date,
  channelGrouping,
  totals.visits
  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
  WHERE channelGrouping in ('Organic Search', 'Direct')
  ORDER BY totals.visits desc
  LIMIT 1000

)

Let’s talk a bit about this nested query structure - you’ll find it comes in handy often when you have to run multiple layers of math or functions.

Nesting queries

In our date example, we first had to run the PARSE_DATE function on our date column, to make it a proper date field rather than a string:

SELECT 
PARSE_DATE('%Y%m%d', date) as date_value
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 

Once we had that done, then we could run our day, day_of_week, and yyyymm functions on that pre-processed date_value column - by merely adding a new SELECT statement around the query we’d already written.

In effect, we’re querying the output of a previous query, rather than querying a BigQuery table directly:

SELECT
date_value,
EXTRACT(DAY from date_value) as day,
EXTRACT(WEEK from date_value) as day_of_week,
FORMAT_DATE("%Y-%m", date_value) AS yyyymm
FROM
(
  SELECT 
  PARSE_DATE('%Y%m%d', date) as date_value
  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
)

This way, instead of having to repeat the PARSE_DATE function 3 times (for each of the day, day_of_week and yyyymm columns), you can write it once, and then reference it in a later query.

Nesting is critical for keeping your queries simple, but beware - using more than 2 or 3 levels of nesting will make you want to pull your hair out later on.

If you find yourself needing to write a really complex, multi-level nested query, then I’d recommend learning to use a framework like dbt (getdbt.com) to be able to reference SQL queries within other queries.

Unnesting RECORD arrays

Remember those weird field types, that contain sub-columns? Check out totals for example:

totals.visits,
totals.transactions,
totals.transactionRevenue

The column ‘totals’ is what’s called a RECORD in BigQuery - long story short, it’s an array of data within a single row of data.

Since the sample GA data is at the session level (each row = 1 session), and each session can have a number of hits, the ‘hits’ columns are also structured like this.

To access these nested RECORD columns, there’s a specific parameter to pass in your query:

CROSS JOIN UNNEST(hits)

This will flatten the array, and make it queryable using basic SQL (see BQ docs here).

SELECT  
date,
channelGrouping,
isEntrance,
page.pagePath,
totals.transactions,
totals.transactionRevenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
CROSS JOIN UNNEST(hits)
WHERE date = '20170801'

Once you unnest the hits RECORD, you’re able to query the sub-columns by removing the ‘hits.’ before the column name (hits.page.pagePath becomes queryable as page.pagePath, hits.item.productName -> item.productName, etc).

For example, let’s say we wanted to filter out only entrance hits, when a user first lands on your site. There’s a sub-column of the hits RECORD called hits.isEntrance. If it equals true, then that row is, er, an entrance.

Let’s query out only entrance hits:

SELECT  
date,
channelGrouping,
isEntrance,
page.pagePath,
totals.transactions,
totals.transactionRevenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
CROSS JOIN UNNEST(hits)
WHERE date = '20170801'
AND isEntrance = true

Being able to wield CROSS JOIN UNNEST will open up the true power of BigQuery for you, as lots of other APIs (Shopify, FB Ads, etc) make use of BigQuery’s nested array column functionality.

Quiz break 1!

Let’s check in with your knowledge so far, and answer a few questions using the Google Analytics sample dataset for 8/1/2017.

To take the quiz, login or signup for the free course, Getting Started with BigQuery.

Joining tables

Our handy Google Analytics sample dataset lives within one BigQuery table, but the data you’ll be working with generally won’t be so clean.

It’ll live in multiple tables across different datasets, and you’ll have to do some gymnastics to join it together.

There are a number of ways to join tables together (INNER JOINS, FULL OUTER JOINS, AUSTRALIAN JOINS, BRAZILIAN JOINS), but in BigQuery we mainly use straight LEFT JOINS (you can read up on the rest of those join types at w3schools).

A LEFT JOIN is when you take all of one table (your first table), and join rows from a second table to it only where they match a certain logic. It’s basically a VLOOKUP formula in Google Sheets.

Let’s look at an example - what if we wanted to calculate the population by US state using BigQuery public datasets?

We’d have to join together the 2010 Census dataset by ZIP code with the US ZIP codes dataset, which will allow us to lookup the state that each ZIP code belongs to.

The joining part of our SQL query falls when we select our tables:

FROM `bigquery-public-data.utility_us.zipcode_area` a
LEFT JOIN `bigquery-public-data.census_bureau_usa.population_by_zip_2010` b
ON (
  a.zipcode = b.zipcode
)

To set up your join, you first give each table you’re joining an alias (a and b in our case), to make referencing their columns easier.

Then, in the ‘ON’ parameter, you specify the logic for your join - the columns that need to equal each other to join them together.

You still SELECT and GROUP BY columns in the same way - except you now have access to columns from both tables a (states by zipcode) and b (population by zipcode) - you can select specific columns by adding the table alias (a. or b.) before the column name:

SELECT  
a.zipcode,
a.state_code,
sum(b.population) population
FROM `bigquery-public-data.utility_us.zipcode_area` a
LEFT JOIN `bigquery-public-data.census_bureau_usa.population_by_zip_2010` b
ON (
  a.zipcode = b.zipcode
)
WHERE census.minimum_age is null
AND census.maximum_age is null
AND census.gender is null
GROUP BY a.zipcode, a.state_code

That query’s a bit tough to read though - we’re doing a lot of other logic in the WHERE statement.

A helpful hint when joining tables, is to use a WITH statement beforehand to declare CTEs (common table expressions), which allows you to do pre-processing before your actual joins.

For example:

WITH zipcodes as (
  SELECT
  zipcode,
  state_code 
  FROM `bigquery-public-data.utility_us.zipcode_area`
),

census as (
  SELECT
  zipcode,
  sum(population)
  FROM `bigquery-public-data.census_bureau_usa.population_by_zip_2010`
  WHERE census.minimum_age is null
  AND census.maximum_age is null
  AND census.gender is null
  GROUP BY zipcode
)

SELECT 
zipcodes.zipcode,
zipcodes.state_code,
census.population
FROM zipcodes
LEFT JOIN census 
ON (
  zipcodes.zipcode = census.zipcodes
)

At the top of the query, you can define each table you’ll use, and do any filtering + grouping beforehand.

Then, when you join your tables together, you’re doing a straight join rather than also doing some math after the fact. That’s just the style that we like to write SQL - not critical if you prefer straight joining, but it helps a lot with readability after the fact.

 

Window (analytic) functions

It’s pretty common when working with marketing datasets to want to calculate a % of total column (ie the % of total revenue coming from a given channel for the period), or the difference from the average (to filter for anomalies).

BigQuery allows you to use window (or analytic) functions to perform this type of math - where you calculate some math on your query in aggregate, but write the results to each row in the dataset.

Using our sample Google Analytics dataset, let’s calculate each channel’s percentage of total pageviews.

First, we’ll query out total pageviews by channel:

SELECT
channelGrouping,
sum(totals.pageViews) as pageviews
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
GROUP BY channelGrouping

Then, we can wrap a window function around this query to calculate the overall channel average + the total pageviews across all channels, as well as the pct of total pageviews for a given channel.

The basic syntax of a window function is:

sum(pageviews) OVER (PARTITION BY date) as total_pageviews

The key elements here are the function (sum), which will aggregate the sum total for each partition in the window.

The PARTITION BY statement basically behaves like a GROUP BY - here we’re saying group by date, since we want to know the total pageviews for each date.

Put the whole query together, and it looks like so:

SELECT
date,
channelGrouping,
pageviews,
sum(pageviews) OVER w1 as total_pageviews,
pageviews / sum(pageviews) OVER w1 as pct_of_pageviews
FROM (
  SELECT
  date,
  channelGrouping,
  sum(totals.pageViews) as pageviews
  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
  GROUP BY channelGrouping, date
)
WINDOW w1 as (PARTITION BY date)
ORDER BY pct_of_pageviews desc

Notice how, since we’re using the same WINDOW (PARTITION BY xx) twice, we define it at the end of our query (WINDOW w1 as) and reference it with OVER w1, instead of re-writing it twice.

Once you have your feet wet in BigQuery, I highly recommend getting your feet wet with these advanced analytic functions (and don’t be afraid to read the docs). They’ll open up an entire new world of analysis possibilities.

We most commonly end up using these analytic functions:

  • last_value()
  • first_value()
  • sum()
  • max()
  • min()
  • avg()
  • rank()

Deduping query results

BigQuery is often used as an append-only database - you can do merging + overwriting of tables, but given storage is so cheap it's usually better to append new data and just query what you need.

This means that you could end up with duplicate values for a given unique row - for example, if you’re using Stitch to push Google Analytics (or any API’s) data to BigQuery, you’ll have to dedupe it before using it.

Fortunately, this is easy to do using window functions - the usage can seem a bit complex at first, but bear with me.

From the sample Google Analytics dataset, let’s say we want to pull out the last hit on a given day for each channelGrouping. Let’s use a window (aka analytic) function:

first_value(VisitStartTime) over (PARTITION BY channelGrouping ORDER BY visitStartTime desc) lv

The key elements here are the function (first_value), and the PARTITION BY of channelGrouping (which behaves like a GROUP BY).

The ORDER BY is required if you want to pull a first_value, last_value, or rank - since we want the latest timestamp, we’re going to pull the first_value of with visitStartTime descending.

To ultimately answer our question of what was the last hit of the day for each channelGrouping, we also have to SELECT only values where the visitStartTime is equal to the last value:

SELECT * FROM (

  SELECT  
  date,
  channelGrouping,
  totals.hits,
  visitStartTime,
  first_value(VisitStartTime) over (PARTITION BY channelGrouping ORDER BY visitStartTime desc) lv
  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` 
)
WHERE visitStartTime = lv

Tools like Stitch, Fivetran or Supermetrics that write data from APIs to BigQuery, will always have a system column that ticks up either a unique integer or timestamp for each row written to the database (in Stitch’s case it’s the _sdc_sequence column).

Similarly to how we used visitStartTime as the field to ORDER BY above, you can duplicate the same query structure using _sdc_sequence to dedupe data from Stitch.

For example, this is how we deduplicate FB Ads data:

SELECT * 
FROM ( 
    SELECT 
        date_start, 
        campaign_id, 
        campaign_name, 
        ad_id, 
        account_name, 
        spend, 
        reach, 
        inline_link_clicks, 
        _sdc_sequence, 
        first_value(_sdc_sequence) OVER (PARTITION BY date_start, ad_id, campaign_id ORDER BY _sdc_sequence DESC) lv 
    FROM {{ target.project }}.fb_ads.ads_insights 
    ) 
WHERE lv = _sdc_sequence

It may seem complex at first, but you’ll end up using this same pattern to dedupe BigQuery data so often that it’ll become second nature.

Quiz break 2!

Let’s test your knowledge on some of these more advanced topics (joining + window functions), again using the Google Analytics sample dataset for 8/1/2017, and also layering in US 2010 census data and US zip code + state mappings.

To take the quiz, login or signup for the free course, Getting Started with BigQuery.  

BigQuery Nuts and Bolts

When it comes time putting your BigQuery knowledge into practice, there are some practical concerns to go over:

  1. How much does it cost?
  2. How can you save your queries to be re-run in the future?
  3. How will you get data into BigQuery?

BigQuery billing

For the most part, the datasets we’re using for marketing data analysis qualify as small data in the relative BigQuery sense.

It’s a platforms designed to be able to quickly query very large volumes of data, so analyzing a few million rows of Google Analytics data is no biggie.

For that reason, running BigQuery queries is very inexpensive - they charge you by the query, rather than for the data you’re storing in the database.

Your first 1TB of queries is free, and the rate is only $5.00 per TB after that (BQ docs here).

As an example, we have never incurred BigQuery costs of over $10 per month for any Agency Data Pipeline implementation we’ve done.

BigQuery does include the functionality of table clustering and partitioning to cut down on query costs - in our experience though, these haven’t been truly necessary with marketing datasets.

The bottom line: BigQuery is very inexpensive relative to the speed + value it brings to your organization.  

Saving queries with dbt

One thing we highly recommend doing to keep your query volumes down, is building any SQL queries that you’ll use frequently into data models using a framework like dbt.

This will allow you to run them once a day, and create much smaller tables that you can then query directly, rather than having to bootstrap them (and incur the cost) every time you want to run them.

For just a brief intro to dbt, check out this excerpt from our Build your Agency Data Pipeline course:

If there’s one next step I recommend, it’d be learning dbt - it’ll take your SQL capabilities to the next level. 

Pushing data to BigQuery from Sheets

When we used to do data pipeline consulting, we'd find ourselves pushing lots of data from Sheets up to BigQuery.

For APIs like Google Analytics or FB Ads, we use off-the-shelf ETL tools (Fivetran, Stitch, Supermetrics, etc) to push data to BigQuery.

But there’s always data that we need to manually push from Sheets to BigQuery:

  • Mappings between GA UTM tags (source / medium / campaign) and higher-level channel names
  • Lists of active data feeds (ie all FB Ads accounts) to be joined together
  • Lists of team member names + their client assignments, for team-level reporting

To help automate this process, we built a Sheets to BigQuery Connector script that does a few handy things for us:

It creates BigQuery tables, pushes data from Sheets to BQ, and allows us to easily write queries to pull data back down from BQ to Sheets (for QC or reporting).

Grab it for free from the CIFL BigQuery course here.

You made it!

Now that you’re fluent at SQL in BigQuery, what will you do - go to Disneyworld potentially?

There are a few next destinations on CIFL we’d recommend:

Build cacheworthy data experiences

Stay updated with the latest tutorials + events, dedicated to helping you build wonderful in-app data experiences.

Subscribe to the newsletter