Cohort Retention in SQL
From Raw Events to a Decision-Ready Table
Most retention tables are not wrong because the SQL is complicated.
They are wrong because the definitions are loose.
Someone says, “Let’s look at retention,” a query gets written, a heatmap shows up in a dashboard, and suddenly everyone is talking about Week 1 and Month 1 as if those numbers are objective facts. They usually are not. They are the result of choices. What counts as the start of a user’s journey? What counts as a return? What exactly is a week? What timezone are we using? Are we measuring one user once per period, or accidentally counting heavy users multiple times?
That is the real work in cohort retention. Not the division. Not the pivot table. The real work is deciding what story the table is allowed to tell.
At its core, cohort analysis is simple. You group users by a shared starting point, then measure what those users do in later periods. That is the common backbone behind most cohort SQL tutorials and warehouse implementations.
What makes it tricky is that small choices can change the story enough to change the decision.
So in this piece, I want to show you how I think about cohort retention in SQL when I want something that is not just presentable, but actually trustworthy. We will walk through a small sample dataset, turn it into a retention table step by step, and discuss the parts that often go wrong: cohort definition, return-event design, week boundaries, duplicate activity, partial cohorts, and interpretation.
Start with the question, not the query
Before touching SQL, I like to ask one uncomfortable question:
What exactly do I want this retention table to help me decide?
That question matters because different cohort definitions answer different business questions.
If I group users by the week they signed up, I am usually asking something about onboarding, activation, or acquisition quality. I want to know whether new users are sticking around after entering the funnel.
If I group users by the week they first did something meaningful, I am asking something slightly different. I am saying that signup is not the real beginning of value. Maybe the real beginning is the first login, the first purchase, the first report built, or the first document uploaded. In that case, I am less interested in the funnel entry and more interested in what happens once a user actually starts using the product.
Both are valid. But they are not interchangeable.
The same holds for the return event. If I define retention as “any page view,” my table might look reassuring while hiding the fact that users are not doing anything meaningful. If I define retention as “purchase,” the metric might be more valuable but also much sparser. There is no universally correct event. There is only one event that is more or less aligned with the value loop you care about.
Then there is the time bucket. This is the part people often treat as neutral, even though it really isn’t. A daily retention table tells a different story than a weekly one. A weekly table tells a different story than a monthly one. And even the idea of a “week” is less fixed than people think. BigQuery, for example, distinguishes between WEEK, WEEK(<WEEKDAY>), and ISOWEEK, and those choices affect how dates are grouped and how period differences are calculated.
That is why I think of cohort retention as a design problem before I think of it as a SQL problem.
The version we’re building here
To make this concrete, let’s keep the example small and explicit.
In this walkthrough:
A user’s cohort is the week of their first login
Retention means they performed a login in a later week
The table uses calendar weeks
Each user should count at most once per week
That last condition matters a lot. If a user logs in ten times in the same week, they are still one retained user for that week. Retention is about whether someone came back in the period, not how noisy their event stream was.
Sample data
Here is a tiny events table we can use end-to-end.



