Retention and its sister metric, churn, seem tantalizing simple to calculate:
Retention (aka rebuy rate) = Returning customers this period / Customers last period
Churn = 100% – Retention
Easy right?
But when it comes time to aggregate up your retention + churn rates each month end, you quickly realize that you have a problem:
Customers who didn’t repeat purchase this month, by definition have no data.
Welcome to the core problem of calculating retention + churn…you must first generate data to fill the gaps of what each customer of your business didn’t do.
Allow me to show you 3-step method for filling these gaps in BigQuery SQL – at CIFL, we’ve used this methodology to calculate for both retention at SaaS businesses and rebuy rates at Ecommerce shops.
This method can also be flexibly adapted to calculate weekly, monthly, quarterly or annual retention rates.
The code snippets below are written in BigQuery standard SQL, but will work with slight modification in any flavor of SQL.
If you prefer video, here’s a 10-minute walkthrough of the topic – code snippets mentioned in the video are listed down the page.
To fill in the gaps in each customer’s history, first we need a time series to work from.
In BigQuery, we use the GENERATE_DATE_ARRAY function – this statement will generate a series of dates from 1/1/2020 until the current date.
GENERATE_DATE_ARRAY(DATE('2020-01-01'), CURRENT_DATE(), INTERVAL 1 DAY)
Taking this a step further, you can use this array to generate an array of week start dates (Sundays or Mondays) or month start / end dates.
These Gists contain the exact code that we use to generate these weekly / monthly series in our own BigQuery data pipelines:
Now that we’ve bootstrapped the date series that we want to fill in for each customer, the next step is to join them together with a straight JOIN.
This will generate a series of dates for each customer – you can calculate which transactions belong to each period with a simple CASE statement:
SELECT user_id, week_start, week_end, week_number, sum(CASE WHEN date >= week_start AND date <= week_end THEN transactions_created ELSE 0 END) as transactions_created FROM date_spine JOIN transactions ON date_spine.site = transactions.site GROUP BY user_id, week_start, week_end, week_number
Since we generally build pipelines for organizations that work on multiple sites, we always have a ‘site’ column to join by – but you could substitute that for any dummy column.
The CASE statement is really the key there – it only counts the customer’s transactions if they belong to the relevant week (or month, quarter, year, etc) period.
So at the end of this first layer of our query, we have a time series of each customer’s behavior for each time period.
To calculate retention, we need to know three things:
We already know what they did in the current period (from the previous step of the query).
To answer if this is their first active period, we need to know their lifetime count of transactions. If this period’s total = their lifetime total, then generally they’ll be a new customer in this period – in which case we’ll want to separate them out of the retention calculation.
We can calculate this by using a window (aka analytic) function, which allows us to quickly map out each buyer’s lifetime behavior:
sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime
Notice the PARTITION BY parameter there is strictly user_id – this is analogous to GROUP BY in a regular query.
We can use a similar window function to pluck out that customer’s previous period behavior:
lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev
The LAG and LEAD functions will allow you to iterate over the specified partition. This window generates a series of each user’s behavior, ordered in ascending order.
So if we’re lagging from the current row, we’ll pull in their previous week’s transactions.
Put it all together, and the query looks like so:
SELECT user_id, week_start, week_end, week_number, transactions_created, sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime, lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev FROM ( {{ step 1 query }} )
In the 3rd (and final) step of this query, we’ll map out how this period / last period / lifetime behavior transactions into retention rates.
Now that we’ve got the table set, it’s time to actually dig into calculating retention rates.
We do this using four CASE statements, to flag a buyer as either one of four buckets:
Those status flags map out into CASE statements in a query like so:
SELECT user_id, week_start, week_end, week_number, CASE WHEN transactions_created = transactions_created_lifetime and transactions_created > 0 THEN 1 ELSE 0 END as new_flag, CASE WHEN transactions_created_prev > 0 AND transactions_created = 0 THEN 1 ELSE 0 END as churns_flag, CASE WHEN transactions_created > 0 AND transactions_created_prev = 0 and transactions_created != transactions_created_lifetime THEN 1 ELSE 0 END as reactivations_flag, CASE WHEN transactions_created > 0 AND transactions_created_prev > 0 THEN 1 ELSE 0 END as retentions_flag FROM {{ step 2 query }}
You could combine these 4 CASE statements into one ‘retention_status’ CASE statement, but we prefer separating them out as singular flag fields.
This allows us to use simple calculated fields in reporting to calculate retention rates:
Retention = sum(retentions_flag) / (sum(retentions_flag) + sum(churns_flag))
You could also include reactivations in the denominator there – if more people reactivated vs churned, you could have retention > 100% for the period.
And to the contrary, for our sister metric churn:
Churn = sum(churns_flag) / (sum(retentions_flag) + sum(churns_flag))
Separating out each user into one of these four buckets (new, retained, churned, reactivated) also makes generating cohort analysis charts very simple.
This chart, for example, breaks all monthly orders into new vs existing buyers (retained + reactivated):
You can also, of course, visualize a classic retention rate curve over time:
If you’re ready to dive into the full retention rate query, check out this Gist:
To sum it up, we calculate retention by writing a query in 3 layers:
To modify this Gist for any time period, just swap out the weekly variables (week_start, week_end, etc) with the columns present in your monthly, quarterly or annual date date spine from Step 0 above.
Curious about diving deeper into these types of metrics for your business?
Check out our Ecommerce or SaaS data pipeline services, in which our team will stand up a full BigQuery data pipeline for your business.
Get off the ground quickly, with customizable data pipeline Recipes for BigQuery.
GET COOKING