What is A/B Testing?
A/B testing is a controlled experiment where two or more variants of a webpage, message, or feature are shown to randomised user cohorts simultaneously, with the winner chosen by a statistically significant lift on a defined success metric.
A/B testing replaces opinion with evidence. Done well, it is the engine of compounding conversion improvement; done badly, it produces false positives that destroy revenue. The discipline lives in three places: sample size (run long enough to reach significance, not just yesterday’s numbers), test isolation (no overlapping experiments contaminating each other), and decision rules agreed before the test runs. Treat A/B testing like a research function, not a button-pushing exercise.
What it includes
- Primary success metric and minimum detectable effect defined upfront
- Sample-size calculation done before the test ships
- Statistical significance threshold (commonly p < 0.05) agreed in advance
- Isolation from other concurrent tests on the same surface
- Documented hypothesis: what we believe, why, and what would falsify it
- Pre-registered decision rule: what we do with each possible outcome
How it works
Start from a falsifiable hypothesis
“If we move the trust strip above the fold, qualified consult bookings will lift 15%, because parents need a credibility signal before scrolling.” Falsifiable, measurable, time-boxed.
Size the test
Use a sample-size calculator. Account for daily traffic, baseline conversion rate, expected lift, and significance threshold. Most tests need 2–4 weeks at typical traffic.
Ship the variant cleanly
Server-side or fast client-side A/B tooling. Flicker is a confound. One variant per test, isolated to one surface, no overlapping experiments.
Wait for significance
Do not call a winner on day three. Peeking is the most common cause of false positives. Let the test reach its pre-calculated sample size.
Decide, document, ship, archive
Winner ships; loser archives. Document the result in a test log so the team can review patterns over time. Failed tests teach more than wins.
Frequently asked
When should we A/B test?
When traffic is sufficient to reach significance in 2–6 weeks. Below ~1,000 conversions/month per variant, statistical power is too low to learn anything useful. Below that, ship and learn qualitatively.
How big should the test lift be?
Plan for the minimum lift you actually care about. Tests powered to detect a 1% lift waste months for marginal gains. Most growth teams pre-commit to a 10–20% MDE on primary metrics.
Frequentist or Bayesian?
Frequentist (p-values) is the default in most tools and well-understood. Bayesian frameworks (probability of being better) communicate more intuitively to non-technical stakeholders. Both work; consistency matters more than choice.