A/B Testing
Super Simple Guide
From "what the f**k even is a p-value?" to running your first real A/B test , everything explained simply. No PhD required. No boring textbooks. Just vibes and stats. 🔥
What is A/B Testing?
Imagine you're a scientist, but instead of test tubes, you use websites. A/B testing is just a controlled experiment on real users.
You own a pizza shop and wonder: "If I put a red button instead of a
blue button on my online order page, will more people click 'Order Now'?"
So you show Group A the blue button and Group B the red button, at the
same time, randomly, and measure which gets more clicks. That's all A/B testing is!
Control Group
The current version , what already exists. Your baseline.
🔵 Blue ButtonTreatment Group
The new version , the thing you changed. Your hypothesis.
🔴 Red ButtonWhy can't I just… change it and see what happens?
Because the world changes! If you change your button on a Monday, Tuesday traffic might be different anyway. Running A & B at the same time removes external noise.
Randomization
Each user is randomly assigned to A or B , like flipping a coin.
Isolation
Only ONE thing changes. Everything else stays the same.
Measurement
You define a metric upfront , click rate, sign-up rate, revenue, etc.
Key term , Conversion Rate: The fraction of users who do the thing you want.
If 50 out of
1,000 users clicked the button: CR = 50/1000 = 5% = 0.05
Your Guess: The Hypothesis
Before you run your test, you write down your guess. AND you write down the "boring guess" too, the one that says nothing special happened. Here's why both matter! 🕵️
Imagine a detective. They don't just say "I think the butler did it!" and
stop there. They start by saying: "I don't know who did it yet." Then they collect clues.
In
A/B testing, we do the same thing! We start by saying "nothing changed" and then we try to
prove ourselves wrong with data. Cool, right?
The Boring Guess (Null Hypothesis)
This is the "nothing special happened" guess. We assume the red button and the blue button do exactly the same. No difference at all. Zero. Zilch. 😐
In plain English: "The red button gets the same number of clicks as the blue button. Nothing changed."
Your Exciting Guess (Alternative Hypothesis)
This is YOUR guess , the one you're trying to prove! You think the red button is different (better or worse) than the blue one. 🎉
In plain English: "The red button gets MORE or FEWER clicks than the blue one. Something actually changed!"
Two ways to guess 🎯
| Type | Your guess | Use it when… |
|---|---|---|
| Two-Tailed | p_B ≠ p_A |
"It could be better OR worse , I'm not sure which!" (Use this one most of the time.) |
| One-Tailed | p_B > p_A |
"I only care if it's better!" (Easier to prove, but be careful , you're making a big assumption.) |
Super important rule: Write your guess BEFORE you look at any data! If you look at the results first and then make up a guess, that's cheating , it's called HARKing and it gives you fake answers. Always guess first, look later! 🙅
Standard Error (SE)
SE is your "wiggle amount", it tells you how much your answer might move around just by luck, even if nothing actually changed. 🌊
Imagine your friend says they're really good at bowling. You watch them
play just 3 games. They got lucky once and knocked everything down! But is that their real
skill?
Now imagine watching them play 300 games. Now you have a much better idea of
how good they really are.
SE works the same way. The more people you test (more
games!), the smaller your SE gets, and the more confident you are. More data = smaller wiggle = better
answer! 🎯
The SE Formula for one group
When you're measuring how many people clicked your button, SE looks like this:
The SE Formula for comparing A vs B (the real one!)
You want to know if A and B are actually different. So you mix their data together and calculate one big SE:
Easy rule to remember: More people tested = smaller SE = more trustworthy answer! If you double the number of people you test, your SE gets about 1.4× smaller. So always test enough people! 💪
The Z-Score
Now you know the difference between A and B, and how "wiggly" that difference is (SE). The Z-score asks: "Is this difference big enough to be real, or is it just luck?" 🎲
Imagine all kids at school line up by height. Most are in the middle
(average). Very few are super tall or super short. That's what a bell curve looks like!
Z-score tells
you: "How far away from normal is my result?"
If Z is close to 0 → totally normal,
probably just luck. If Z is ±2 or bigger → wow, that's very unusual! Something real probably happened! 🎉
What does Z actually tell me? 🤔
Think of Z like a surprise score. A small Z means "nothing weird here." A big Z means "whoa, this is surprising!" Here's a cheat sheet:
| Z Value | Meaning | How common? |
|---|---|---|
| |Z| < 1.0 | Tiny signal, easily explained by noise | Very common even if nothing changed |
| |Z| ≈ 1.64 | Borderline , used for one-tailed 95% tests | 5% chance if H₀ is true (one side) |
| |Z| ≈ 1.96 | The magic number for two-tailed 95% significance | 5% chance if H₀ is true (both sides) |
| |Z| > 2.58 | Strong signal , 99% significance threshold | 1% chance if H₀ is true |
The P-Value
This is the most famous number in all of data science, and also the most confused! Let's make it crystal clear. 💎
Imagine your friend shows you a coin and flips it 10
times. It lands on heads 9 times! You think, "wait, is this coin magic
(rigged)?"
The p-value answers: "If the coin is totally normal and fair, how likely is it to get
9 heads out of 10 just by accident?"
The answer is about 1%, very unlikely! So you'd say:
"That coin is probably rigged!" 🎉
In A/B testing: "If NOTHING actually
changed, how likely is it that we'd see THIS big a difference just by luck?" A tiny p-value means
it's very unlikely to be just luck!
The Proper Definition (in plain words)
P-value = "If nothing actually changed, how likely is it that I'd see results this big or bigger, just by random luck?"
Common mistakes about p-value , Don't fall for these!
❌ It does NOT mean "the chance that I'm wrong" , it's more subtle than that
❌ It does NOT tell you HOW MUCH better B is than A
❌ A low p-value does NOT mean the effect is big or important!
How to calculate p-value from Z 🔢
The Magic Threshold: Alpha (α) 🎯
Before the test, you decide: "How unlikely does my result need to be before I believe something real happened?" The most common answer is 5% (α = 0.05). It's like a jury saying "I need 95% certainty before saying guilty!"
p-value < 0.05
YES! 🎉 The result is real! Less than 5% chance it was just luck. Ship the new version!
p-value ≥ 0.05
Not sure yet. Too much chance this was just luck. Need more data before deciding.
"Significant" doesn't mean "amazing" or "huge"! Even a tiny, boring change can have a low p-value if you test millions of people. Always also check: How big is the actual difference? Is it worth caring about? 🤔
Confidence Intervals (CI)
Instead of just ONE number, a CI gives you a whole range! It says "the truth is probably somewhere in HERE." Way more honest! 📦
What's more helpful?
📺 "Tomorrow will be exactly 21°C"
📱 "Tomorrow
will be between 18°C and 24°C"
The second one is more honest because it admits: "We're not 100%
sure, but we're pretty confident it's in this range!"
A 95% Confidence
Interval works the same way. It says: "We're 95% sure the true effect is somewhere in this
range."
If we ran the same experiment 100 times, the real answer would land inside our range about
95 times! 🎯
How to read your CI like a pro 🏆
| Your CI says... | What it means in real life |
|---|---|
[+2%, +8%] , all positive |
🎉 B is definitely better! The entire range is good news. Launch it! |
[-8%, -2%] , all negative |
🚫 B is definitely worse! Don't use it. Throw it away. |
[-1%, +5%] , crosses zero |
🤔 We're not sure! It might be good or bad. Test more people before deciding! |
[+0.01%, +0.02%] , tiny range |
😒 Yes it's real, but who cares? The effect is so tiny it doesn't matter in real life. |
Cool connection: CI and p-value always agree! If zero is NOT inside your 95% CI → p-value is less than 0.05 → it's significant! They're just two ways of saying the same thing. 🤝
The Two Ways to be Wrong
Even when you do everything right, statistics can trick you! There are exactly two ways this can happen. Let's meet them! 👾
Type I Error , False Alarm: The alarm goes off during lunch. Everyone runs outside. But
oops , there was no real fire! You thought something was happening when it wasn't. In A/B testing:
you said "B is better!" but it actually wasn't. Just luck tricked you. 🙈
Type II Error , Missed Fire: There IS a real fire in the art room, but the alarm never went
off! Nobody noticed. In A/B testing: B genuinely WAS better, but your test didn't catch it. You missed the
real thing. 😬
(no real effect)
(real effect exists)
(say "significant!")
(say "not significant")
Your False Alarm Rate
You get to choose this! Usually we set α = 0.05, meaning: "I'm okay with being tricked by luck about 5% of the time , 1 in 20 experiments." Lower α = harder to fool you, but you need more data.
Your Miss Rate
This is how often you miss a real effect. Usually β = 0.20 , meaning you're okay missing real effects 20% of the time. Test more people → beta goes down → you miss fewer real effects!
Power & How Many People to Test
Power is your "chance of catching the real thing." Sample size is how many people you need to test. They work together! 💪
Imagine you're looking for a tiny star in the sky. With a small toy telescope, you'll probably miss it. With a giant telescope, you can see everything! Statistical Power is like the size of your telescope , the higher the power, the more likely you'll spot a real effect if it exists. Most scientists aim for at least 80% power , meaning they'll catch a real effect 8 times out of 10.
To get 80% power, you need enough people. Too few → you'll miss real things. The formula for how many people you need:
MDE , The Smallest Effect That Counts 🔍
MDE = Minimum Detectable Effect. It's the smallest difference you want to be able to catch. Imagine trying to notice if someone grew 1mm taller vs 1 foot taller , the smaller the difference, the more people you need to test! A small MDE needs a LOT more people.
The golden rule: MDE, sample size, power, and α are all connected like a seesaw. Make one bigger, another gets smaller. Use the calculator below to find the right balance! ⚖️
🚫 Mistakes to Never Make
Peeking at Results Early 👀
Imagine peeking at your exam paper before you're done. If you stop the test early the moment you see a good result, you're cheating! You'll think you found something real when it's just lucky noise. Wait until you reach your planned sample size!
Testing 20 Things at Once 🎰
If you test 20 different button colors at once and say "5% chance of false alarm each" , you'll get about 1 fake winner GUARANTEED just by accident. Test one thing at a time, or use special math to correct for it (Bonferroni correction).
Too Few People 🐜
Testing 50 people is like flipping a coin 3 times and deciding if it's rigged. Not enough! Use the sample size calculator below to know the minimum before you start.
The "New Toy" Effect 🆕
People click new things just because they're new and shiny , not because they're actually better! Always run your test long enough that the novelty wears off. Usually at least 1-2 full weeks.
Simpson's Paradox 🤯
The overall data says B wins, but when you look at phone users vs computer users separately , BOTH groups actually prefer A! The groups were different sizes and tricked you. Always look at segments separately, not just the total!
🧮 Full A/B Test Calculator
Enter your experiment results below and get a complete statistical analysis.
📋 The Cheat Sheet
Everything you need, on one page.
⚡ The A/B Testing Checklist
Made with 💚 for everyone who's ever stared at a p-value in confusion
Remember: statistics is just a tool for making better decisions under uncertainty.