Beyond A/B Testing: How I Analyze Real Product Impact Without an Experiment

In an ideal world, every product change would be tested with a randomized controlled trial. Reality is messier. Sometimes you cannot randomize: the feature already shipped to everyone, legal will not let you hold out a group, or the sample is too small for the test to have any power.

When I am in that situation, I go to quasi-experimental methods. Here is my playbook.

The problem with observational data

The hard problem is confounding. People who use a new feature are not the same as people who do not. Maybe they are more engaged, more tech-savvy, or signed up during a specific campaign. Comparing adopters and non-adopters tells you almost nothing about the feature itself.

I see teams make this mistake a lot. They announce a “win” that is really just selection bias. Nobody checks.

Method 1: Difference-in-Differences

When a feature rolls out at different times to different groups, say, by region or by platform, DiD can work well. The key assumption is parallel trends: treated and control groups would have moved together without the treatment.

Interactive · Figure Difference-in-Differences: in motion Drag the sliders to see how a violation of the parallel-trends assumption poisons the naive estimator.

True treatment effect+15 Pre-trend divergence (violation)0

Naive post-comparison +0.0 Treated − Control at the end

DiD estimate +0.0 Removes any common trend

Bias (naive − true) +0.0 What you would over-claim

Always plot the pre-trends. If they are not parallel, DiD will mislead you. As the slider above shows, even small violations produce meaningful bias in the naive comparison.

# Simplified DiD
import statsmodels.formula.api as smf

model = smf.ols("outcome ~ treated * post + C(group) + C(time)", data=df)
results = model.fit()
# The treated:post coefficient is your treatment effect

Method 2: Synthetic control

When you have one treated unit and many possible controls, synthetic control builds a weighted mix of the controls that best matches the treated unit before the intervention. Then you measure the gap after.

I use this a lot for geo experiments. It handles the noise of real markets better than a simple comparison.

Method 3: Regression discontinuity

If treatment is assigned by a threshold (e.g., users above some engagement score get the feature), RD uses the jump at the threshold. Users just above and just below the cutoff are nearly identical, so you get local randomization for free.

This one is underused. A lot of product features have natural cutoffs that nobody exploits.

When to use what

Method	Best when	Key assumption
DiD	Staggered rollout	Parallel trends
Synthetic Control	Single treated unit	Pre-treatment fit
RD	Assignment cutoff	Continuity at cutoff

Which method fits your case?

Answer a few questions below and I’ll point you at the one that fits.

Interactive · Method Picker

Which causal method fits your situation?

Answer a few questions. I'll recommend one from my working playbook.

Bottom line

No method is perfect. The best you can do is combine several and see if they tell a consistent story. When they disagree, that is usually where the interesting learning happens. It means you are missing something about the underlying dynamics, and the disagreement is a hint about what.