A/B Test Calculator

Q: What does the p-value actually mean?

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) what you observed, assuming the null hypothesis (no difference) is true. A p-value of 0.03 means that if there truly were no difference between control and variant, there would be only a 3% chance of seeing a gap this large by random sampling variation. It does NOT mean there is a 97% probability the variant is truly better - that is a common misconception. Small p-values are evidence against the null hypothesis, not proof of the alternative.

Q: What is statistical power in an A/B test?

Statistical power is the probability that your test will detect a real effect if one exists. An 80% power means that if the variant truly has the conversion rate you observed, you would correctly reject the null hypothesis 80% of the time. Low power (below 70%) means your test risks missing real improvements - you run the experiment and see p = 0.09 and incorrectly conclude there is no difference. Power depends on sample size, effect size, and significance level. The power estimate in this calculator is post-hoc (computed from your observed data), useful for interpreting a non-significant result.

Q: What is the Minimum Detectable Effect (MDE)?

The MDE is the smallest true lift your experiment can reliably detect given your current sample sizes, at 80% power and your chosen significance level. If the MDE is 0.5 percentage points but you expect the variant to improve conversion by only 0.2 percentage points, your test is underpowered and you should collect more data before drawing conclusions. The MDE is computed using the formula: MDE = (z_α + z_β) × √(p(1−p)/n_C + p(1−p)/n_V), where z_β = 0.842 for 80% power.

Q: Can I stop an A/B test early when it reaches significance?

Stopping early ('peeking') inflates the false positive rate significantly. If you check significance every day and stop as soon as p < 0.05, your actual false positive rate can be 20–30% even though you are using a 5% threshold. To combat this, either pre-commit to a fixed sample size and check only once at the end, use sequential testing methods (like sequential probability ratio tests), or use alpha-spending functions. Most product experimentation platforms (Optimizely, VWO, Google Optimize) use sequential or Bayesian methods precisely because of this problem.

Q: What is a two-proportion Z-test?

The two-proportion Z-test is the standard method for comparing two independent binomial proportions. It computes a pooled standard error using the combined conversion rate from both groups, then calculates how many standard errors the observed difference is from zero. The formula is Z = (p_V − p_C) / SE_pooled, where SE_pooled = √(p̂(1−p̂)(1/n_C + 1/n_V)) and p̂ = (c_C + c_V) / (n_C + n_V). This is valid when both groups have at least 5 conversions and 5 non-conversions.

Enter your experiment data to check if the conversion rate difference is statistically significant.

📖 What is an A/B Test?

An A/B test (also called a split test or controlled experiment) is a method of comparing two versions of something - a web page, email subject line, button colour, pricing page, or product feature - by randomly assigning users to one version (control) or the other (variant) and measuring which version produces better outcomes. It is the gold standard for evidence-based product and marketing decisions because it isolates the effect of the change being tested from all other variables.

The core question of any A/B test is: is the observed difference in conversion rates real, or could it be explained by random chance? Statistical significance testing answers this question by computing a p-value - the probability of observing a difference this large (or larger) if the two versions truly performed identically. A p-value below your significance threshold (typically 0.05) means the result is statistically significant: you can reject the null hypothesis that there is no difference.

A/B testing uses the two-proportion Z-test. Both groups have binary outcomes (converted or not converted), so each conversion follows a Bernoulli distribution. With sufficient sample sizes, the sampling distribution of the difference in proportions is approximately normal by the Central Limit Theorem, allowing Z-test inference. The pooled standard error uses a combined estimate of the conversion probability under the null hypothesis that both groups share the same rate.

Beyond the p-value, a complete A/B test analysis includes: the confidence interval for the true difference (which tells you the range of plausible lift values), the effect size (absolute and relative lift), statistical power (the probability of detecting a real effect given your sample sizes), and the minimum detectable effect (the smallest lift your test is powered to find). This calculator computes all of these from your experiment data.

📐 Formulas

Z = (p̂_V − p̂_C) / √(p̂(1−p̂) × (1/n_C + 1/n_V))

Where:

p̂_C = c_C / n_C - control conversion rate (conversions / visitors)

p̂_V = c_V / n_V - variant conversion rate

p̂ = (c_C + c_V) / (n_C + n_V) - pooled conversion rate (used only for the null hypothesis)

p-Value (two-tailed): p = 2 × Φ(−|Z|), where Φ is the standard normal CDF.

Confidence interval for difference (unpooled SE):

CI = (p̂_V − p̂_C) ± z_α/2 × √(p̂_C(1−p̂_C)/n_C + p̂_V(1−p̂_V)/n_V)

Minimum Detectable Effect (at 80% power):

MDE = (z_α + z_β) × √(p̂_C(1−p̂_C)/n_C + p̂_C(1−p̂_C)/n_V) where z_β = 0.842 for 80% power.

Absolute Lift: p̂_V − p̂_C

Relative Lift: (p̂_V − p̂_C) / p̂_C × 100%

📖 How to Use This Calculator

Enter the number of visitors and conversions for your control group (the original, unchanged version).

Enter the same for the variant group (the new version you are testing). Ensure each visitor is counted only once and groups are mutually exclusive.

Select your significance level - 95% (α = 0.05) is the industry standard. Use 99% for high-stakes decisions. Select two-tailed unless you pre-registered a directional hypothesis.

Click Calculate Significance. The verdict at the top tells you immediately whether the result is significant. Review the Z-stat, p-value, and confidence interval for the full picture.

Check the MDE: if it is larger than the effect you expect from the variant, your test is underpowered and you should collect more data before concluding there is no effect.

💡 Example Calculations

Example 1 - E-commerce checkout button test (significant result)

Setup: Control (green button) - 5,000 visitors, 150 purchases (3.00% conversion). Variant (orange button) - 5,000 visitors, 183 purchases (3.66% conversion).

Pooled rate: p̂ = (150 + 183) / 10,000 = 3.33%. SE = √(0.0333 × 0.9667 × (1/5000 + 1/5000)) = 0.00254.

Z = (0.0366 − 0.0300) / 0.00254 = 2.60. p-value (two-tailed) = 0.0093 - significant at 99% confidence.

Result: Absolute lift = +0.66 pp. Relative lift = +22%. 95% CI: +0.16% to +1.16%. The entire CI is positive - confidently beneficial. Ship the orange button.

Result = Z = 2.60, p = 0.0093 - Significant at 99% confidence

Try this example →

Example 2 - Landing page headline test (not significant)

Setup: Control - 2,000 visitors, 90 sign-ups (4.50%). Variant - 2,000 visitors, 99 sign-ups (4.95%). Looks like a 10% relative lift.

Z = (0.0495 − 0.0450) / SE = 0.96. p-value (two-tailed) = 0.34 - not significant. This difference is easily explained by chance.

Power: ~30% - the test is severely underpowered. The MDE at 80% power is ~1.3 percentage points. The observed lift of 0.45 pp is too small to detect reliably at this sample size.

Conclusion: Do not call this a win. Either collect more data (you need ~15,000 visitors per group to detect a 0.45 pp lift at 95% confidence) or accept that the headline change has negligible impact.

Result = Z = 0.96, p = 0.34 - Not significant (underpowered, ~30% power)

Try this example →

❓ Frequently Asked Questions

How many visitors do I need for an A/B test?+

Sample size depends on your baseline conversion rate, the minimum lift you want to detect (MDE), significance level, and desired power. As a rough guide: if your control converts at 3% and you want to detect a 20% relative lift (to 3.6%), you need roughly 15,000–20,000 visitors per variant at 95% confidence and 80% power. The lower your baseline rate or the smaller the effect you want to detect, the more visitors you need. Use the MDE output from this calculator as a guide - if the MDE is larger than the lift you realistically expect, you need more traffic.

Should I use a one-tailed or two-tailed test for A/B testing?+

Best practice is to use a two-tailed test for most A/B tests. A two-tailed test asks 'is there any difference?' whereas a one-tailed test asks 'is the variant better?' The problem with pre-committing to a one-tailed test is that it makes it easier to reach significance (the critical Z is lower) but you can miss cases where the variant is worse. Two-tailed tests are more conservative and reduce the risk of false positives. Only use a one-tailed test if you genuinely cannot act on a negative result and you pre-registered the direction before data collection.

What is absolute lift vs relative lift in an A/B test?+

Absolute lift is the raw difference in conversion rate: variant rate minus control rate. If control is 3.0% and variant is 3.6%, the absolute lift is +0.6 percentage points. Relative lift is the percentage change relative to control: 0.6 / 3.0 = +20%. Absolute lift is more conservative and better for decision-making because a 20% relative lift sounds impressive, but if it's 2.0% → 2.4%, the absolute gain is only 0.4 percentage points - which may not justify the engineering cost. Always report both.

What is the difference between statistical significance and business significance?+

Statistical significance (p < 0.05) only tells you the result is unlikely to be random noise - it says nothing about whether the effect is large enough to matter in practice. A large enough experiment can find statistically significant differences that are tiny and commercially irrelevant. Business significance asks: is the lift large enough to justify shipping? Does the confidence interval exclude trivial effects? Always combine the p-value with the confidence interval and a minimum business-relevant effect size to make launch decisions.

What does the p-value actually mean?+

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) what you observed, assuming the null hypothesis (no difference) is true. A p-value of 0.03 means that if there truly were no difference between control and variant, there would be only a 3% chance of seeing a gap this large by random sampling variation. It does NOT mean there is a 97% probability the variant is truly better - that is a common misconception. Small p-values are evidence against the null hypothesis, not proof of the alternative.

What is statistical power in an A/B test?+

Statistical power is the probability that your test will detect a real effect if one exists. An 80% power means that if the variant truly has the conversion rate you observed, you would correctly reject the null hypothesis 80% of the time. Low power (below 70%) means your test risks missing real improvements - you run the experiment and see p = 0.09 and incorrectly conclude there is no difference. Power depends on sample size, effect size, and significance level. The power estimate in this calculator is post-hoc (computed from your observed data), useful for interpreting a non-significant result.

What is the Minimum Detectable Effect (MDE)?+

The MDE is the smallest true lift your experiment can reliably detect given your current sample sizes, at 80% power and your chosen significance level. If the MDE is 0.5 percentage points but you expect the variant to improve conversion by only 0.2 percentage points, your test is underpowered and you should collect more data before drawing conclusions. The MDE is computed using the formula: MDE = (z_α + z_β) × √(p(1−p)/n_C + p(1−p)/n_V), where z_β = 0.842 for 80% power.

Can I stop an A/B test early when it reaches significance?+

Stopping early ('peeking') inflates the false positive rate significantly. If you check significance every day and stop as soon as p < 0.05, your actual false positive rate can be 20–30% even though you are using a 5% threshold. To combat this, either pre-commit to a fixed sample size and check only once at the end, use sequential testing methods (like sequential probability ratio tests), or use alpha-spending functions. Most product experimentation platforms (Optimizely, VWO, Google Optimize) use sequential or Bayesian methods precisely because of this problem.

What is a two-proportion Z-test?+

The two-proportion Z-test is the standard method for comparing two independent binomial proportions. It computes a pooled standard error using the combined conversion rate from both groups, then calculates how many standard errors the observed difference is from zero. The formula is Z = (p_V − p_C) / SE_pooled, where SE_pooled = √(p̂(1−p̂)(1/n_C + 1/n_V)) and p̂ = (c_C + c_V) / (n_C + n_V). This is valid when both groups have at least 5 conversions and 5 non-conversions.

🔗 Related Calculators

📌 Quick Tips

💡Use a two-tailed test when you want to detect any difference (higher or lower). Use one-tailed only when you specifically predict the direction before running the experiment.

💡A p-value < 0.05 means there is less than 5% probability of seeing this result by chance if the null hypothesis (no difference) is true.

💡Relative lift is the percentage change in conversion rate from control to variant - this is the number most product teams track for reporting.

A/B Test Calculator

Conversion Rates & Lift

Test Statistics

Power & Sample Size

📖 What is an A/B Test?

📐 Formulas

📖 How to Use This Calculator

💡 Example Calculations

Example 1 - E-commerce checkout button test (significant result)

Example 2 - Landing page headline test (not significant)

❓ Frequently Asked Questions

🔗 Related Calculators

📌 Quick Tips