In an ideal world, every product change would be tested with a randomized controlled trial. Reality is messier. Sometimes you cannot randomize — the feature already shipped to everyone, legal will not let you hold out a group, or the sample is too small for the test to have any power.
When I am in that situation, I go to quasi-experimental methods. Here is my playbook.
The problem with observational data
The hard problem is confounding. People who use a new feature are not the same as people who do not. Maybe they are more engaged, more tech-savvy, or signed up during a specific campaign. Comparing adopters and non-adopters tells you almost nothing about the feature itself.
I see teams make this mistake a lot. They announce a “win” that is really just selection bias. Nobody checks.
Method 1 — Difference-in-Differences
When a feature rolls out at different times to different groups — say, by region or by platform — DiD can work well. The key assumption is parallel trends: treated and control groups would have moved together without the treatment.
Always plot the pre-trends. If they are not parallel, DiD will mislead you. As the slider above shows, even small violations produce meaningful bias in the naive comparison.
# Simplified DiD
import statsmodels.formula.api as smf
model = smf.ols("outcome ~ treated * post + C(group) + C(time)", data=df)
results = model.fit()
# The treated:post coefficient is your treatment effect
Method 2 — Synthetic control
When you have one treated unit and many possible controls, synthetic control builds a weighted mix of the controls that best matches the treated unit before the intervention. Then you measure the gap after.
I use this a lot for geo experiments. It handles the noise of real markets better than a simple comparison.
Method 3 — Regression discontinuity
If treatment is assigned by a threshold (e.g., users above some engagement score get the feature), RD uses the jump at the threshold. Users just above and just below the cutoff are nearly identical, so you get local randomization for free.
This one is underused. A lot of product features have natural cutoffs that nobody exploits.
When to use what
| Method | Best when | Key assumption |
|---|---|---|
| DiD | Staggered rollout | Parallel trends |
| Synthetic Control | Single treated unit | Pre-treatment fit |
| RD | Assignment cutoff | Continuity at cutoff |
Which method fits your case?
Answer a few questions below and I’ll point you at the one that fits.
Which causal method fits your situation?
Answer a few questions — I'll recommend one from my working playbook.
Bottom line
No method is perfect. The best you can do is combine several and see if they tell a consistent story. When they disagree, that is usually where the interesting learning happens — it means you are missing something about the underlying dynamics, and the disagreement is a hint about what.