Published on

Modeling Customer Retention and Churn in SQL

Authors

Retention and its sister metric, churn, seem tantalizing simple to calculate:

Retention = Returning customers this period / Customers last period

Churn = 100% - Retention

Easy right?

But when it comes time to roll up your retention + churn rates each monthend, you quickly realize that you have a problem:

Customers who didn't repeat purchase this month, by definition have no data.

Welcome to the core problem of calculating retention + churn...you must first generate data to fill the gaps of what each customer of your business didn't do in a given period.

Allow me to show you 3-step method for filling these gaps in SQL. This methodology is relevant for both calcualting retention for a SaaS business, and rebuy rates for Ecommerce brands.

This method demonstrates weekly retention rates / churn (useful for 'weekly active' retention in a SaaS product), but can easily adapted to monthly, quarterly or annual retention rates.

The code snippets below are written in BigQuery standard SQL, but will work with slight modification in any flavor of SQL.

If you prefer video, here's a 10-minute walkthrough of the topic - code snippets mentioned in the video are listed down the page.

Step 0: Generate a Date Range

To fill in the gaps in each customer's history, first we need a time series to work from.

In BigQuery, we use the GENERATE_DATE_ARRAY function - this statement will generate a series of dates from 1/1/2020 until the current date.

GENERATE_DATE_ARRAY(DATE('2020-01-01'), CURRENT_DATE(), INTERVAL 1 DAY)

Taking this a step further, you can use this array to generate an array of week start dates (Sundays or Mondays) or month start / end dates.

These snippets generate these weekly / monthly date series - you can modify the hardcoded start date (currently '2019-01-01') to whatever your business requires:

Generate Week Start Date Array

SELECT
date_in_range,
day_number,
week_number,
min(date_in_range) over (partition by week_number) week_start,
max(date_in_range) over (partition by week_number) week_end
FROM (
	SELECT 
	date_in_range,
	date_diff(date_in_range, cast('2019-01-01' as date), DAY)+1 as day_number,
	cast(trunc(date_diff(date_in_range, cast('2019-01-01' as date), DAY)/7)+1 as int64) as week_number
	FROM UNNEST(
    	GENERATE_DATE_ARRAY(DATE('2019-01-01'), CURRENT_DATE(), INTERVAL 1 DAY)
	) AS date_in_range
)

Generate Monthend Date Array

SELECT
date_in_range,
date_in_range_bom,
date_in_range_bom_mom,
date_in_range_yoy
FROM
(
	SELECT 
	date_in_range,
	date_in_range_bom,
	date_sub(date_in_range_bom, INTERVAL 1 MONTH) date_in_range_bom_mom,
	date_sub(date_in_range, INTERVAL 1 YEAR) date_in_range_yoy,
	first_value(date_in_range) over (partition by yyyymm order by date_in_range desc) monthend_date_in_range
	FROM
	( 
		SELECT 
		date_in_range,
		date_trunc( date_in_range, MONTH) date_in_range_bom
		FROM UNNEST(
		    GENERATE_DATE_ARRAY(DATE('2020-01-31'), date_add(CURRENT_DATE(), INTERVAL 31 DAY), INTERVAL 1 DAY)
		) AS date_in_range
	)
)
WHERE date_in_range = monthend_date_in_range

Step 1: Generate Each Customer's Individual Time Series

Now that we've bootstrapped the date series that we want to fill in for each customer, the next step is to join them together with a straight JOIN.

This will generate a series of dates for each customer - you can calculate which transactions belong to each period with a simple CASE statement:

with date_spine as (
    SELECT * FROM the_date_array_you_just_created
),

transactions as (
    SELECT 
        user_id,
        transaction_date,
        transactions_created,
    FROM your_transaction_table
)

SELECT
    user_id,
    week_start,
    week_end,
    week_number,
    sum(CASE WHEN date >= week_start AND transaction_date <= week_end THEN transactions_created ELSE 0 END) as transactions_created
FROM date_spine
JOIN transactions
GROUP BY user_id, week_start, week_end, week_number

The CASE statement is really the key there - it only counts the customer's transactions if they belong to the relevant week (or month, quarter, year, etc) period.

There are many many ways to perform this join, but the important thing is to have date ranges on the left, so that each user_id has an entry for each relevant time period. We're calculating retention at the user level here, but this can be whatever your appropriate entity is (user, organization, account, etc).

You could get fancy and restrict the join at this level to only dates following their first transaction, rather than later on - implementation is up to you! Recommend reading through this entire approach first, and then adapting it to your situation.

Step 2: Calculate Lifetime + Previous Period Behavior

To calculate retention, we need to know three things:

  • Is this a buyer's first active period?
  • Did they fulfill the desired behavior (buying, logging in, maintaining a subscription, etc) in the prior period?
  • Did they "" in the current period?

We already know what they did in the current period (from the previous step of the query).

Calculating Lifetime Buyer Behavior

To answer if this is their first active period, we need to know their lifetime count of transactions. If this period's total = their lifetime total, then generally they'll be a new customer in this period - in which case we'll want to separate them out of the retention calculation.

We can calculate this by using a window (aka analytic) function, which allows us to quickly map out each buyer's lifetime behavior:

sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime

Notice the PARTITION BY parameter there is strictly user_id - this is analogous to GROUP BY in a regular query.

Calculating Previous Period Buyer Behavior

We can use a similar window function to pluck out that customer's previous period behavior:

lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev

The LAG and LEAD functions will allow you to iterate over the specified partition.  This window generates a series of each user's behavior, ordered in ascending order.

So if we're lagging from the current row, we'll pull in their previous week's transactions.

Put it all together, and the query looks like so:

SELECT
    user_id,
    week_start,
    week_end,
    week_number,
    transactions_created,
    sum(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_lifetime,
    lag(transactions_created) over (PARTITION BY user_id ORDER BY week_start asc) as transactions_created_prev
FROM ( {{ step 1 query }} )

We can use lag and lead here, because we've already constructed a full time series for each user.

In the 3rd (and final) step of this query, we'll map out how this period / last period / lifetime behavior translates into retention rates.

Step 3: Translate Behavior into Retention or Churn

Now that we've got the table set, it's time to actually dig into calculating retention rates.

We do this using four CASE statements, to flag a buyer as either one of four buckets:

  • New: this period = lifetime
  • Churned: this period = 0, previous period > 0
  • Retained: this period > 0, previous period > 0
  • Reactivated: this period > 0, previous period = 0, this period < lifetime

Those status flags map out into CASE statements in a query like so:

SELECT
    user_id,
    week_start,
    week_end,
    week_number,
    CASE WHEN transactions_created = transactions_created_lifetime and transactions_created > 0 THEN 1 ELSE 0 END as new_flag,
    CASE WHEN transactions_created_prev > 0 AND transactions_created = 0 THEN 1 ELSE 0 END as churns_flag,
    CASE WHEN transactions_created > 0 AND transactions_created_prev = 0 and transactions_created != transactions_created_lifetime THEN 1 ELSE 0 END as reactivations_flag,
    CASE WHEN transactions_created > 0 AND transactions_created_prev > 0 THEN 1 ELSE 0 END as retentions_flag
FROM {{ step 2 query }}

You could combine these 4 CASE statements into one 'retention_status' CASE statement, but we prefer separating them out as individual flag fields.

This allows us to use simple calculated fields in your reporting tool of choice to calculate retention rates:

sum(retentions_flag) / (sum(retentions_flag) + sum(churns_flag))

You could also include reactivations in the denominator there - if more people reactivated vs churned, you could have net retention greater than 100% for the period.

And to the contrary, for our sister metric churn:

sum(churns_flag) / (sum(retentions_flag) + sum(churns_flag))

Separating out each user into one of these four buckets (new, retained, churned, reactivated) also makes generating cohort analysis charts very simple.

Putting it All Together

If you're ready to dive into a full retention rate query, here you go! This could easily be adapted to run with CTEs (common table expressions) rather than nested SQL, if that's your bag:

SELECT
user_id,
week_start,
week_end,
week_number,
CASE WHEN transactions_created = transactions_created_lifetime and transactions_created > 0 THEN 1 ELSE 0 END as new_flag,
CASE WHEN transactions_created_prev > 0 AND transactions_created = 0 THEN 1 ELSE 0 END as churns_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev = 0 and transactions_created != transactions_created_lifetime THEN 1 ELSE 0 END as reactivations_flag,
CASE WHEN transactions_created > 0 AND transactions_created_prev > 0 THEN 1 ELSE 0 END as retentions_flag
FROM (

	SELECT
	user_id,
	week_start,
	week_end,
	week_number,
	transactions_created,
	sum(transactions_created) over weekly as transactions_created_lifetime,
	lag(transactions_created) over weekly as transactions_created_prev
	FROM (

		SELECT
		user_id,
		week_start,
		week_end,
		week_number,
		sum(CASE WHEN date >= week_start AND date <= week_end THEN transactions_created ELSE 0 END) as transactions_created,
		FROM date_spine
		JOIN transactions
		ON date_spine.site = transactions.site
		GROUP BY user_id,
		week_start,
		week_end,
		week_number
	)
	WINDOW weekly as (PARTITION BY user_id ORDER BY week_start asc)
)

To sum it up, we calculate retention by writing a query in 3 layers:

  1. Generate a date series for each user
  2. Use window functions to calculate lifetime and previous period behavior
  3. Use CASE statements to translate this period / previous period / lifetime behavior into a retention status for that period

To modify these snippets for any time period, just swap out the weekly variables (week_start, week_end, etc) with the columns present in your monthly, quarterly or annual date date spine from Step 0 above.

Build cacheworthy data experiences

Stay updated with the latest tutorials + events, dedicated to helping you build wonderful in-app data experiences.

Subscribe to the newsletter