Modeling Customer Retention in BigQuery SQL

customer retention modeling

David Krevitt

Lover of laziness, connoisseur of lean-back capitalism. Potentially the #1 user of Google Sheets in the world.

Retention and its sister metric, churn, seem tantalizing simple to calculate:

Retention (aka rebuy rate) = Returning customers this period / Customers last period

 

Churn = 100% – Retention

Easy right?

But when it comes time to aggregate up your retention + churn rates each month end, you quickly realize that you have a problem:

Customers who didn’t repeat purchase this month, by definition have no data.

Welcome to the core problem of calculating retention + churn…you must first generate data to fill the gaps of what each customer of your business didn’t do.

Allow me to show you 3-step method for filling these gaps in BigQuery SQL – at CIFL, we’ve used this methodology to calculate for both retention at SaaS businesses and rebuy rates at Ecommerce shops.

This method can also be flexibly adapted to calculate weekly, monthly, quarterly or annual retention rates.

The code snippets below are written in BigQuery standard SQL, but will work with slight modification in any flavor of SQL.

 

If you prefer video, here’s a 10-minute walkthrough of the topic – code snippets mentioned in the video are listed down the page.

Step 0: Generate a Date Range

To fill in the gaps in each customer’s history, first we need a time series to work from.

In BigQuery, we use the GENERATE_DATE_ARRAY function – this statement will generate a series of dates from 1/1/2020 until the current date.

GENERATE_DATE_ARRAY(DATE('2020-01-01'), CURRENT_DATE(), INTERVAL 1 DAY)

Taking this a step further, you can use this array to generate an array of week start dates (Sundays or Mondays) or month start / end dates.

These Gists contain the exact code that we use to generate these weekly / monthly series in our own BigQuery data pipelines:

Generate Week Start Date Array

Generate Monthend Date Array

 

Step 1: Generate Each Customer’s Individual Time Series

Now that we’ve bootstrapped the date series that we want to fill in for each customer, the next step is to join them together with a straight JOIN.

This will generate a series of dates for each customer – you can calculate which transactions belong to each period with a simple CASE statement:

SELECT
user_id,
week_start,
week_end,
week_number,
sum(CASE WHEN date >= week_start AND date <= week_end THEN transactions_created ELSE 0 END) as transactions_created
FROM date_spine
JOIN transactions
ON date_spine.site = transactions.site
GROUP BY user_id, week_start, week_end, week_number

 

Since we generally build pipelines for organizations that work on multiple sites, we always have a ‘site’ column to join by – but you could substitute that for any dummy column.

The CASE statement is really the key there – it only counts the customer’s transactions if they belong to the relevant week (or month, quarter, year, etc) period.

So at the end of this first layer of our query, we have a time series of each customer’s behavior for each time period.

 

Step 2: Calculate Lifetime + Previous Period Behavior

To calculate retention, we need to know three things:

We already know what they did in the current period (from the previous step of the query).

Calculating Lifetime Buyer Behavior

To answer if this is their first active periodwe need to know their lifetime count of transactions.  If this period’s total = their lifetime total, then generally they’ll be a new customer in this period – in which case we’ll want to separate them out of the retention calculation.

We can calculate this by using a window (aka analytic) function, which allows us to quickly map out each buyer’s lifetime behavior:

sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime

 

Notice the PARTITION BY parameter there is strictly user_id – this is analogous to GROUP BY in a regular query.

Calculating Previous Period Buyer Behavior

We can use a similar window function to pluck out that customer’s previous period behavior:

lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev

 

The LAG and LEAD functions will allow you to iterate over the specified partition.  This window generates a series of each user’s behavior, ordered in ascending order.

So if we’re lagging from the current row, we’ll pull in their previous week’s transactions.

Put it all together, and the query looks like so:

SELECT
user_id,
week_start,
week_end,
week_number,
transactions_created,
sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime,
lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev
FROM ( {{ step 1 query }} )

 

In the 3rd (and final) step of this query, we’ll map out how this period / last period / lifetime behavior transactions into retention rates.

 

Step 3: Translate Behavior into Retention or Churn

Now that we’ve got the table set, it’s time to actually dig into calculating retention rates.

We do this using four CASE statements, to flag a buyer as either one of four buckets:

Those status flags map out into CASE statements in a query like so:

SELECT
user_id,
week_start,
week_end,
week_number,
CASE WHEN transactions_created = transactions_created_lifetime and transactions_created > 0 THEN 1 ELSE 0 END as new_flag,
CASE WHEN transactions_created_prev > 0 AND transactions_created = 0 THEN 1 ELSE 0 END as churns_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev = 0 and transactions_created != transactions_created_lifetime THEN 1 ELSE 0 END as reactivations_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev > 0 THEN 1 ELSE 0 END as retentions_flag
FROM {{ step 2 query }}

 

You could combine these 4 CASE statements into one ‘retention_status’ CASE statement, but we prefer separating them out as singular flag fields.

This allows us to use simple calculated fields in reporting to calculate retention rates:

Retention = sum(retentions_flag) / (sum(retentions_flag) + sum(churns_flag))

You could also include reactivations in the denominator there – if more people reactivated vs churned, you could have retention > 100% for the period.

And to the contrary, for our sister metric churn:

Churn = sum(churns_flag) / (sum(retentions_flag) + sum(churns_flag))

Separating out each user into one of these four buckets (new, retained, churned, reactivated) also makes generating cohort analysis charts very simple.

This chart, for example, breaks all monthly orders into new vs existing buyers (retained + reactivated):

shopify cohort analysis

You can also, of course, visualize a classic retention rate curve over time:

ecommerce retention rate curve

 

Putting it All Together

If you’re ready to dive into the full retention rate query, check out this Gist:

To sum it up, we calculate retention by writing a query in 3 layers:

  1. Generate a date series for each user
  2. Use window functions to calculate lifetime and previous period behavior
  3. Use CASE statements to translate this period / previous period / lifetime behavior into a retention status for that period

To modify this Gist for any time period, just swap out the weekly variables (week_start, week_end, etc) with the columns present in your monthly, quarterly or annual date date spine from Step 0 above.

Curious about diving deeper into these types of metrics for your business?

Check out our Ecommerce or SaaS data pipeline services, in which our team will stand up a full BigQuery data pipeline for your business.

Ready to build your pipeline?

My name is David, and I help companies automate their data analysis in BigQuery. I’m standing by to chat about how we can help you get more done.

FIND OUT MORE