A/B Testing
Super Simple Guide

From "what the f**k even is a p-value?" to running your first real A/B test , everything explained simply. No PhD required. No boring textbooks. Just vibes and stats. 🔥

A
vs
B
Scroll to explore

What is A/B Testing?

Imagine you're a scientist, but instead of test tubes, you use websites. A/B testing is just a controlled experiment on real users.

🍕
Pizza Shop Analogy 🍕

You own a pizza shop and wonder: "If I put a red button instead of a blue button on my online order page, will more people click 'Order Now'?"

So you show Group A the blue button and Group B the red button, at the same time, randomly, and measure which gets more clicks. That's all A/B testing is!

A

Control Group

The current version , what already exists. Your baseline.

🔵 Blue Button
B

Treatment Group

The new version , the thing you changed. Your hypothesis.

🔴 Red Button

Why can't I just… change it and see what happens?

Because the world changes! If you change your button on a Monday, Tuesday traffic might be different anyway. Running A & B at the same time removes external noise.

🎯

Randomization

Each user is randomly assigned to A or B , like flipping a coin.

⚖️

Isolation

Only ONE thing changes. Everything else stays the same.

📏

Measurement

You define a metric upfront , click rate, sign-up rate, revenue, etc.

💡

Key term , Conversion Rate: The fraction of users who do the thing you want.
If 50 out of 1,000 users clicked the button: CR = 50/1000 = 5% = 0.05

Your Guess: The Hypothesis

Before you run your test, you write down your guess. AND you write down the "boring guess" too, the one that says nothing special happened. Here's why both matter! 🕵️

🕵️
Think like a Detective! 🔍

Imagine a detective. They don't just say "I think the butler did it!" and stop there. They start by saying: "I don't know who did it yet." Then they collect clues.

In A/B testing, we do the same thing! We start by saying "nothing changed" and then we try to prove ourselves wrong with data. Cool, right?

H₀

The Boring Guess (Null Hypothesis)

This is the "nothing special happened" guess. We assume the red button and the blue button do exactly the same. No difference at all. Zero. Zilch. 😐

H₀: p_B − p_A = 0
Null Hypothesis

In plain English: "The red button gets the same number of clicks as the blue button. Nothing changed."

H₁

Your Exciting Guess (Alternative Hypothesis)

This is YOUR guess , the one you're trying to prove! You think the red button is different (better or worse) than the blue one. 🎉

H₁: p_B − p_A ≠ 0
Alternative Hypothesis (Two-Tailed)

In plain English: "The red button gets MORE or FEWER clicks than the blue one. Something actually changed!"

Two ways to guess 🎯

Type Your guess Use it when…
Two-Tailed p_B ≠ p_A "It could be better OR worse , I'm not sure which!" (Use this one most of the time.)
One-Tailed p_B > p_A "I only care if it's better!" (Easier to prove, but be careful , you're making a big assumption.)
⚠️

Super important rule: Write your guess BEFORE you look at any data! If you look at the results first and then make up a guess, that's cheating , it's called HARKing and it gives you fake answers. Always guess first, look later! 🙅

Standard Error (SE)

SE is your "wiggle amount", it tells you how much your answer might move around just by luck, even if nothing actually changed. 🌊

🎳
The Bowling Friend Analogy 🎳

Imagine your friend says they're really good at bowling. You watch them play just 3 games. They got lucky once and knocked everything down! But is that their real skill?

Now imagine watching them play 300 games. Now you have a much better idea of how good they really are.

SE works the same way. The more people you test (more games!), the smaller your SE gets, and the more confident you are. More data = smaller wiggle = better answer! 🎯

The SE Formula for one group

When you're measuring how many people clicked your button, SE looks like this:

SE = √( p × (1 − p) / n )
Standard Error of a Proportion
p = your click rate (e.g. 10% = 0.10) n = how many people you tested

The SE Formula for comparing A vs B (the real one!)

You want to know if A and B are actually different. So you mix their data together and calculate one big SE:

p̂ = (x_A + x_B) / (n_A + n_B)
SE_diff = √( p̂ × (1 − p̂) × (1/n_A + 1/n_B) )
Pooled Standard Error
= combined click rate from BOTH groups x_A, x_B = number of clicks in each group
Live SE Calculator
10%
12%
1000
SE Group A,
SE Group B,
Pooled SE (difference),

Easy rule to remember: More people tested = smaller SE = more trustworthy answer! If you double the number of people you test, your SE gets about 1.4× smaller. So always test enough people! 💪

The Z-Score

Now you know the difference between A and B, and how "wiggly" that difference is (SE). The Z-score asks: "Is this difference big enough to be real, or is it just luck?" 🎲

📏
The Height Contest Analogy 📏

Imagine all kids at school line up by height. Most are in the middle (average). Very few are super tall or super short. That's what a bell curve looks like!

Z-score tells you: "How far away from normal is my result?"

If Z is close to 0 → totally normal, probably just luck. If Z is ±2 or bigger → wow, that's very unusual! Something real probably happened! 🎉

Z = (p_B − p_A) / SE_diff
Z-Statistic (Test Statistic)
p_B − p_A = how different A and B are SE_diff = the wiggle amount (standard error)

What does Z actually tell me? 🤔

Think of Z like a surprise score. A small Z means "nothing weird here." A big Z means "whoa, this is surprising!" Here's a cheat sheet:

Z Value Meaning How common?
|Z| < 1.0 Tiny signal, easily explained by noise Very common even if nothing changed
|Z| ≈ 1.64 Borderline , used for one-tailed 95% tests 5% chance if H₀ is true (one side)
|Z| ≈ 1.96 The magic number for two-tailed 95% significance 5% chance if H₀ is true (both sides)
|Z| > 2.58 Strong signal , 99% significance threshold 1% chance if H₀ is true
Z-Score Visualizer
2.00

The P-Value

This is the most famous number in all of data science, and also the most confused! Let's make it crystal clear. 💎

🪙
The Magic Coin 🪙

Imagine your friend shows you a coin and flips it 10 times. It lands on heads 9 times! You think, "wait, is this coin magic (rigged)?"

The p-value answers: "If the coin is totally normal and fair, how likely is it to get 9 heads out of 10 just by accident?"

The answer is about 1%, very unlikely! So you'd say: "That coin is probably rigged!" 🎉

In A/B testing: "If NOTHING actually changed, how likely is it that we'd see THIS big a difference just by luck?" A tiny p-value means it's very unlikely to be just luck!

The Proper Definition (in plain words)

P-value = "If nothing actually changed, how likely is it that I'd see results this big or bigger, just by random luck?"

🚨

Common mistakes about p-value , Don't fall for these!
❌ It does NOT mean "the chance that I'm wrong" , it's more subtle than that
❌ It does NOT tell you HOW MUCH better B is than A
❌ A low p-value does NOT mean the effect is big or important!

How to calculate p-value from Z 🔢

p-value = 2 × (1 − Φ(|Z|))
Two-tailed p-value from Z
Φ = magic math function that reads off the bell curve The "2×" is because we check BOTH sides (better OR worse)

The Magic Threshold: Alpha (α) 🎯

Before the test, you decide: "How unlikely does my result need to be before I believe something real happened?" The most common answer is 5% (α = 0.05). It's like a jury saying "I need 95% certainty before saying guilty!"

p-value < 0.05

YES! 🎉 The result is real! Less than 5% chance it was just luck. Ship the new version!

🤷

p-value ≥ 0.05

Not sure yet. Too much chance this was just luck. Need more data before deciding.

⚠️

"Significant" doesn't mean "amazing" or "huge"! Even a tiny, boring change can have a low p-value if you test millions of people. Always also check: How big is the actual difference? Is it worth caring about? 🤔

Confidence Intervals (CI)

Instead of just ONE number, a CI gives you a whole range! It says "the truth is probably somewhere in HERE." Way more honest! 📦

🌦️
The Weather App Analogy 🌦️

What's more helpful?
📺 "Tomorrow will be exactly 21°C"
📱 "Tomorrow will be between 18°C and 24°C"

The second one is more honest because it admits: "We're not 100% sure, but we're pretty confident it's in this range!"

A 95% Confidence Interval works the same way. It says: "We're 95% sure the true effect is somewhere in this range."
If we ran the same experiment 100 times, the real answer would land inside our range about 95 times! 🎯

CI = (p_B − p_A) ± z* × SE_diff
95% Confidence Interval
z* = 1.96 for 95% CI z* = 2.576 for 99% CI z* = 1.645 for 90% CI

How to read your CI like a pro 🏆

Your CI says... What it means in real life
[+2%, +8%] , all positive 🎉 B is definitely better! The entire range is good news. Launch it!
[-8%, -2%] , all negative 🚫 B is definitely worse! Don't use it. Throw it away.
[-1%, +5%] , crosses zero 🤔 We're not sure! It might be good or bad. Test more people before deciding!
[+0.01%, +0.02%] , tiny range 😒 Yes it's real, but who cares? The effect is so tiny it doesn't matter in real life.
💡

Cool connection: CI and p-value always agree! If zero is NOT inside your 95% CI → p-value is less than 0.05 → it's significant! They're just two ways of saying the same thing. 🤝

The Two Ways to be Wrong

Even when you do everything right, statistics can trick you! There are exactly two ways this can happen. Let's meet them! 👾

🚨
The School Fire Drill Analogy 🚨

Type I Error , False Alarm: The alarm goes off during lunch. Everyone runs outside. But oops , there was no real fire! You thought something was happening when it wasn't. In A/B testing: you said "B is better!" but it actually wasn't. Just luck tricked you. 🙈

Type II Error , Missed Fire: There IS a real fire in the art room, but the alarm never went off! Nobody noticed. In A/B testing: B genuinely WAS better, but your test didn't catch it. You missed the real thing. 😬

H₀ is TRUE
(no real effect)
H₀ is FALSE
(real effect exists)
You REJECT H₀
(say "significant!")
🚨 Type I Error False Positive rate = α
Correct! True Positive rate = Power
You KEEP H₀
(say "not significant")
Correct! True Negative rate = 1 − α
😶 Type II Error False Negative rate = β
α , Alpha

Your False Alarm Rate

You get to choose this! Usually we set α = 0.05, meaning: "I'm okay with being tricked by luck about 5% of the time , 1 in 20 experiments." Lower α = harder to fool you, but you need more data.

β , Beta

Your Miss Rate

This is how often you miss a real effect. Usually β = 0.20 , meaning you're okay missing real effects 20% of the time. Test more people → beta goes down → you miss fewer real effects!

Power & How Many People to Test

Power is your "chance of catching the real thing." Sample size is how many people you need to test. They work together! 💪

🔭
The Telescope Analogy 🔭

Imagine you're looking for a tiny star in the sky. With a small toy telescope, you'll probably miss it. With a giant telescope, you can see everything! Statistical Power is like the size of your telescope , the higher the power, the more likely you'll spot a real effect if it exists. Most scientists aim for at least 80% power , meaning they'll catch a real effect 8 times out of 10.

Power = 1 − β
Statistical Power
The goal: Power ≥ 80% , catch real effects 8 out of 10 times!

To get 80% power, you need enough people. Too few → you'll miss real things. The formula for how many people you need:

n = ( z_α/2 + z_β )² × [ p_A(1−p_A) + p_B(1−p_B) ] / (p_B − p_A)²
Required Sample Size Per Group
z_α/2 = 1.96 for α = 0.05 two-tailed z_β = 0.842 for 80% power

MDE , The Smallest Effect That Counts 🔍

MDE = Minimum Detectable Effect. It's the smallest difference you want to be able to catch. Imagine trying to notice if someone grew 1mm taller vs 1 foot taller , the smaller the difference, the more people you need to test! A small MDE needs a LOT more people.

🎯

The golden rule: MDE, sample size, power, and α are all connected like a seesaw. Make one bigger, another gets smaller. Use the calculator below to find the right balance! ⚖️

Sample Size Calculator
10%
+10%
Per Group,
Total Users,
Detected CR (B),
Absolute Lift,

🚫 Mistakes to Never Make

1

Peeking at Results Early 👀

Imagine peeking at your exam paper before you're done. If you stop the test early the moment you see a good result, you're cheating! You'll think you found something real when it's just lucky noise. Wait until you reach your planned sample size!

2

Testing 20 Things at Once 🎰

If you test 20 different button colors at once and say "5% chance of false alarm each" , you'll get about 1 fake winner GUARANTEED just by accident. Test one thing at a time, or use special math to correct for it (Bonferroni correction).

3

Too Few People 🐜

Testing 50 people is like flipping a coin 3 times and deciding if it's rigged. Not enough! Use the sample size calculator below to know the minimum before you start.

4

The "New Toy" Effect 🆕

People click new things just because they're new and shiny , not because they're actually better! Always run your test long enough that the novelty wears off. Usually at least 1-2 full weeks.

5

Simpson's Paradox 🤯

The overall data says B wins, but when you look at phone users vs computer users separately , BOTH groups actually prefer A! The groups were different sizes and tricked you. Always look at segments separately, not just the total!

🧮 Full A/B Test Calculator

Enter your experiment results below and get a complete statistical analysis.

CR , Group A,
CR , Group B,
Absolute Lift,
Relative Lift,
Pooled SE,
Z-Score,
P-Value,
95% CI,

📋 The Cheat Sheet

Everything you need, on one page.

Conversion Rate
Conversions ÷ Visitors
SE
Standard Error
√(p·(1-p)/n) , variability of your estimate
Z
Z-Score / Test Statistic
(p_B - p_A) / SE_diff
p
P-Value
P(data this extreme | H₀ true)
α
Significance Level
Threshold for p-value. Usually 0.05
β
Type II Error Rate
Prob. of missing a real effect. Usually 0.20
1-β
Statistical Power
Prob. of detecting a real effect. ≥ 80%
CI
Confidence Interval
estimate ± z* × SE
MDE
Min. Detectable Effect
Smallest lift your test is powered to detect
H₀
Null Hypothesis
"No difference" , default assumption
H₁
Alternative Hypothesis
What you're trying to prove
n
Sample Size (per group)
Use (z_α + z_β)² formula to calculate

⚡ The A/B Testing Checklist

☐ Define your metric before starting
☐ Write down H₀ and H₁ upfront
☐ Calculate required sample size
☐ Randomly assign users to A or B
☐ Run for full weeks (avoid day-of-week bias)
☐ Don't peek mid-test!
☐ Check for SRM (Sample Ratio Mismatch)
☐ Segment results by device, user type
☐ Report effect size AND p-value
☐ Document and share learnings

Made with 💚 for everyone who's ever stared at a p-value in confusion

Remember: statistics is just a tool for making better decisions under uncertainty.