Bonferroni Correction Calculator

Q: Why do we need multiple testing correction?

When you run multiple hypothesis tests simultaneously, the probability of making at least one false positive (Type I error) increases rapidly - even if each individual test uses a 5% threshold. With k independent tests, the family-wise error rate (FWER) is 1 − (1 − 0.05)^k. For 10 tests, FWER = 40%; for 20 tests, FWER = 64%. Without correction, a scientist running 20 comparisons would expect about one false significant finding by chance alone, even with no real effects. Multiple testing correction controls the FWER at the desired level (e.g. 5%) across the entire family of tests.

Q: How does Bonferroni correction work?

Bonferroni correction is the simplest approach: divide your significance threshold α by the number of tests k, giving an adjusted threshold of α/k. Each individual test is then compared to this stricter threshold. The logic is conservative: under the Bonferroni inequality, the probability of any one of k tests producing a false positive is at most k × (α/k) = α. Bonferroni is exact when tests are independent and conservative (over-corrects) when tests are positively correlated - which is common in practice.

Q: What is the Holm-Bonferroni method and how does it differ from Bonferroni?

The Holm-Bonferroni method (Holm, 1979) is a sequential step-down procedure that is always at least as powerful as Bonferroni while still controlling FWER at α. It works by sorting the p-values in ascending order and comparing the i-th smallest p-value to α/(k−i+1) rather than α/k. If the smallest p-value passes, move to the next; if any p-value fails, all subsequent ones are declared non-significant. Because the thresholds for smaller p-values are less strict than for larger ones (compared to a flat α/k), Holm rejects more hypotheses and is therefore more powerful. Use Holm when you have multiple p-values to evaluate.

Q: What is the Šidák correction?

The Šidák correction is an exact alternative to Bonferroni for independent tests. Instead of dividing α by k, it uses the formula α_Šidák = 1 − (1 − α)^(1/k). This is derived from the exact probability that none of k independent tests produce a false positive: (1 − α_adj)^k = 1 − α. Šidák is slightly less conservative than Bonferroni because the Bonferroni correction uses an inequality while Šidák uses the exact formula. For k = 20 at α = 0.05: Bonferroni threshold = 0.0025, Šidák threshold = 0.00256 - very similar in practice.

Q: When is Bonferroni too conservative?

Bonferroni can be excessively conservative in several situations: (1) when tests are highly positively correlated, such as testing multiple related outcomes in the same sample - correlated tests provide less independent information, so the true FWER is much lower than Bonferroni assumes; (2) in GWAS (genome-wide association studies) where hundreds of thousands of correlated SNPs are tested - Bonferroni would require p < 5×10⁻⁸ but the tests are not all independent; (3) in exploratory research where you are happy to accept some false positives in exchange for fewer missed discoveries - here the Benjamini-Hochberg FDR procedure (controlling false discovery rate instead of FWER) is more appropriate.

Q: What is FWER vs FDR in multiple testing?

FWER (family-wise error rate) is the probability of making one or more false positives across all tests - Bonferroni and Holm control this. FDR (false discovery rate) is the expected proportion of significant results that are false positives - Benjamini-Hochberg controls this. FWER control is stricter and appropriate when even one false positive is costly (e.g. clinical trials, regulatory submissions). FDR control is more appropriate in exploratory genomics, transcriptomics, or large-scale screening where you expect many true effects and can tolerate some false positives if you follow up with confirmatory experiments.

Q: How is Bonferroni correction used in GWAS?

Genome-wide association studies (GWAS) test millions of single nucleotide polymorphisms (SNPs) for association with a trait. Applying Bonferroni correction at α = 0.05 across ~1 million independent tests gives a threshold of p < 5×10⁻⁸. This is the widely adopted 'genome-wide significance threshold'. However, because many SNPs are in linkage disequilibrium (correlated), the effective number of independent tests is much less than 1 million, making strict Bonferroni overly conservative - permutation testing or spectral decomposition methods are used to compute data-adaptive thresholds.

Q: What happens if I apply Bonferroni after ANOVA pairwise comparisons?

If you run a one-way ANOVA with k groups, post-hoc pairwise comparisons require k(k−1)/2 tests. For 5 groups that is 10 tests; for 8 groups it is 28 tests. Applying Bonferroni to these 28 tests at α = 0.05 gives a threshold of 0.05/28 ≈ 0.0018 - very strict. Dedicated post-hoc tests like Tukey's HSD or Scheffé's method are designed for this situation and generally offer better power than Bonferroni because they account for the ANOVA structure. Bonferroni is a reasonable first approximation but dedicated procedures are preferred.

Calculate adjusted significance thresholds for multiple hypothesis testing to control the family-wise error rate.

📖 What is the Bonferroni Correction?

The Bonferroni correction is a statistical adjustment applied when multiple hypothesis tests are conducted simultaneously. When you perform many tests, the chance of obtaining at least one false positive result - a Type I error - increases substantially even if all null hypotheses are true. The Bonferroni correction addresses this by dividing your significance threshold α by the number of tests k, requiring each individual test to meet the stricter threshold of α/k to be declared significant.

The correction is named after Italian mathematician Carlo Emilio Bonferroni, whose 1936 inequality underlies the method. The Bonferroni inequality states that the probability of any one of k events occurring is at most the sum of their individual probabilities. If each test has false positive probability α/k, the probability of at least one false positive across all k tests is at most k × (α/k) = α - exactly the FWER you want to control.

Beyond the simple Bonferroni threshold, this calculator also computes the Šidák correction - the exact threshold for independent tests - and the Holm-Bonferroni step-down procedure, which is uniformly more powerful than simple Bonferroni. For a list of p-values sorted in ascending order, Holm compares the i-th smallest to α/(k−i+1), allowing more hypotheses to be rejected while still controlling FWER.

Multiple testing corrections are critical in genomics (GWAS tests millions of SNPs), clinical trials with multiple endpoints, psychological studies testing many outcomes, pairwise comparisons after ANOVA, and any analysis where you are fishing through many comparisons. Without correction, a researcher who tests 20 independent hypotheses would expect one false positive at the 5% level even if nothing is real - the replication crisis in psychology and medicine is partly attributed to widespread use of multiple comparisons without correction.

📐 Formulas

Bonferroni: α_adj = α / k

Šidák correction (exact for independent tests): α_Šidák = 1 − (1 − α)^(1/k)

Holm-Bonferroni step-down: Sort p-values p_(1) ≤ p_(2) ≤ … ≤ p_(k). Compare p_(i) to α/(k − i + 1). Reject H_(i) if p_(i) < α/(k − i + 1). Stop at the first non-rejection; declare all remaining as non-significant.

FWER (unadjusted, k independent tests): FWER = 1 − (1 − α)^k

Variables:

k - number of comparisons (hypothesis tests) in the family

α - desired family-wise error rate (typically 0.05)

α_adj - adjusted per-test significance threshold

📖 How to Use This Calculator

Enter k - the total number of hypothesis tests you are performing. For post-ANOVA pairwise comparisons among 5 groups: k = 5×4/2 = 10.

Set your desired FWER (α = 0.05 for 5% family-wise error rate). This is the overall false positive probability across all tests combined.

Optionally enter your actual p-values as a comma-separated list. The calculator will evaluate each against both Bonferroni and Holm thresholds and flag them as significant or not.

Click Calculate Correction. Compare each of your test p-values to the Bonferroni threshold. For the Holm method, follow the table order (sorted ascending) and stop at the first non-significant result.

💡 Example Calculations

Example 1 - Pairwise comparisons after ANOVA (4 groups, 6 tests)

Setup: You test four drug doses against each other: k = 4×3/2 = 6 pairwise tests at FWER = 0.05.

Bonferroni threshold: 0.05 / 6 = 0.0083. Any p-value below 0.0083 is significant after correction.

Your p-values: 0.001, 0.008, 0.020, 0.040, 0.060, 0.200. After Bonferroni: only 0.001 is significant. After Holm: 0.001 (threshold 0.05/6=0.0083 ✓) and 0.008 (threshold 0.05/5=0.010 ✓) are both significant.

Conclusion: Holm finds one extra significant comparison that Bonferroni misses - this is why Holm is always preferred over simple Bonferroni when you have multiple p-values.

Result = Bonferroni α = 0.0083; Holm rejects 2 hypotheses vs Bonferroni's 1

Try this example →

Example 2 - Clinical trial with multiple outcomes (k = 4)

Setup: A trial tests a new drug on 4 outcomes: mortality, hospitalization, quality of life, adverse events. k = 4, α = 0.05.

Bonferroni threshold: 0.05 / 4 = 0.0125. Šidák threshold: 1 − (1 − 0.05)^(1/4) = 0.01274 - virtually identical to Bonferroni.

Observed p-values: 0.003, 0.018, 0.042, 0.180. Only p = 0.003 survives Bonferroni (0.003 < 0.0125). After Holm: 0.003 (threshold 0.05/4=0.0125 ✓), 0.018 (threshold 0.05/3=0.0167 - fails), stop. Still only one significant outcome.

Implication: The drug significantly reduces only mortality after correction. The hospitalization reduction (p = 0.018) is not significant after correcting for 4 outcomes - a result that would look significant without correction.

Result = Bonferroni α = 0.0125; 1 significant outcome (mortality only)

Try this example →

❓ Frequently Asked Questions

Why do we need multiple testing correction?+

When you run multiple hypothesis tests simultaneously, the probability of making at least one false positive (Type I error) increases rapidly - even if each individual test uses a 5% threshold. With k independent tests, the family-wise error rate (FWER) is 1 − (1 − 0.05)^k. For 10 tests, FWER = 40%; for 20 tests, FWER = 64%. Without correction, a scientist running 20 comparisons would expect about one false significant finding by chance alone, even with no real effects. Multiple testing correction controls the FWER at the desired level (e.g. 5%) across the entire family of tests.

How does Bonferroni correction work?+

Bonferroni correction is the simplest approach: divide your significance threshold α by the number of tests k, giving an adjusted threshold of α/k. Each individual test is then compared to this stricter threshold. The logic is conservative: under the Bonferroni inequality, the probability of any one of k tests producing a false positive is at most k × (α/k) = α. Bonferroni is exact when tests are independent and conservative (over-corrects) when tests are positively correlated - which is common in practice.

What is the Holm-Bonferroni method and how does it differ from Bonferroni?+

The Holm-Bonferroni method (Holm, 1979) is a sequential step-down procedure that is always at least as powerful as Bonferroni while still controlling FWER at α. It works by sorting the p-values in ascending order and comparing the i-th smallest p-value to α/(k−i+1) rather than α/k. If the smallest p-value passes, move to the next; if any p-value fails, all subsequent ones are declared non-significant. Because the thresholds for smaller p-values are less strict than for larger ones (compared to a flat α/k), Holm rejects more hypotheses and is therefore more powerful. Use Holm when you have multiple p-values to evaluate.

What is the Šidák correction?+

The Šidák correction is an exact alternative to Bonferroni for independent tests. Instead of dividing α by k, it uses the formula α_Šidák = 1 − (1 − α)^(1/k). This is derived from the exact probability that none of k independent tests produce a false positive: (1 − α_adj)^k = 1 − α. Šidák is slightly less conservative than Bonferroni because the Bonferroni correction uses an inequality while Šidák uses the exact formula. For k = 20 at α = 0.05: Bonferroni threshold = 0.0025, Šidák threshold = 0.00256 - very similar in practice.

When is Bonferroni too conservative?+

Bonferroni can be excessively conservative in several situations: (1) when tests are highly positively correlated, such as testing multiple related outcomes in the same sample - correlated tests provide less independent information, so the true FWER is much lower than Bonferroni assumes; (2) in GWAS (genome-wide association studies) where hundreds of thousands of correlated SNPs are tested - Bonferroni would require p < 5×10⁻⁸ but the tests are not all independent; (3) in exploratory research where you are happy to accept some false positives in exchange for fewer missed discoveries - here the Benjamini-Hochberg FDR procedure (controlling false discovery rate instead of FWER) is more appropriate.

What is FWER vs FDR in multiple testing?+

FWER (family-wise error rate) is the probability of making one or more false positives across all tests - Bonferroni and Holm control this. FDR (false discovery rate) is the expected proportion of significant results that are false positives - Benjamini-Hochberg controls this. FWER control is stricter and appropriate when even one false positive is costly (e.g. clinical trials, regulatory submissions). FDR control is more appropriate in exploratory genomics, transcriptomics, or large-scale screening where you expect many true effects and can tolerate some false positives if you follow up with confirmatory experiments.

How is Bonferroni correction used in GWAS?+

Genome-wide association studies (GWAS) test millions of single nucleotide polymorphisms (SNPs) for association with a trait. Applying Bonferroni correction at α = 0.05 across ~1 million independent tests gives a threshold of p < 5×10⁻⁸. This is the widely adopted 'genome-wide significance threshold'. However, because many SNPs are in linkage disequilibrium (correlated), the effective number of independent tests is much less than 1 million, making strict Bonferroni overly conservative - permutation testing or spectral decomposition methods are used to compute data-adaptive thresholds.

What happens if I apply Bonferroni after ANOVA pairwise comparisons?+

If you run a one-way ANOVA with k groups, post-hoc pairwise comparisons require k(k−1)/2 tests. For 5 groups that is 10 tests; for 8 groups it is 28 tests. Applying Bonferroni to these 28 tests at α = 0.05 gives a threshold of 0.05/28 ≈ 0.0018 - very strict. Dedicated post-hoc tests like Tukey's HSD or Scheffé's method are designed for this situation and generally offer better power than Bonferroni because they account for the ANOVA structure. Bonferroni is a reasonable first approximation but dedicated procedures are preferred.

🔗 Related Calculators

📌 Quick Tips

💡The Holm-Bonferroni method is always more powerful than simple Bonferroni - use Holm when you have a list of p-values. Simple Bonferroni is useful for computing a single threshold before running tests.

💡Bonferroni is most appropriate when the tests are independent. For highly correlated tests (e.g. longitudinal measures), it is overly conservative - consider using FDR correction instead.

💡If you are doing exploratory analysis with dozens of tests, consider controlling the False Discovery Rate (FDR) using the Benjamini-Hochberg procedure instead of FWER.

Bonferroni Correction Calculator

Adjusted Thresholds

Individual Test Results

📖 What is the Bonferroni Correction?

📐 Formulas

📖 How to Use This Calculator

💡 Example Calculations

Example 1 - Pairwise comparisons after ANOVA (4 groups, 6 tests)

Example 2 - Clinical trial with multiple outcomes (k = 4)

❓ Frequently Asked Questions

🔗 Related Calculators

📌 Quick Tips