Hypothesis Testing Calculator

Run a complete, guided hypothesis test in 6 structured steps - from stating H₀ to the final conclusion.

H Hypothesis Testing Calculator

📖 What is Hypothesis Testing?

Hypothesis testing is a formal statistical procedure for using sample data to evaluate a claim about a population parameter. It is the backbone of scientific inquiry - from clinical drug trials to manufacturing quality control to psychological experiments. The procedure answers the question: "Given what I observed in my sample, is there enough evidence to reject the assumption that nothing unusual is happening?"

Every hypothesis test involves two competing statements. The null hypothesis (H₀) is the default, conservative position - typically that a population mean equals a reference value, that two groups are the same, or that a proportion equals a claimed value. The alternative hypothesis (H₁) is what the researcher is trying to demonstrate: that there is a real effect, a real difference, or a meaningful departure from the reference.

The test works by computing a test statistic - a number that summarises how far the sample data is from what H₀ predicts. Under H₀, this statistic follows a known distribution (Z, t, F, chi-square). The p-value measures how likely it is to observe a result at least as extreme as yours if H₀ were true. A small p-value (below the chosen significance level α, typically 0.05) is evidence against H₀, and we reject it in favour of H₁.

This calculator supports five major test types: the one-sample Z-test (population σ known), the one-sample t-test (σ estimated from sample), the one-proportion Z-test, the two-sample Welch's t-test, and the paired t-test. For every test, it produces all six standard steps and computes Cohen's d as an effect size to quantify practical significance alongside statistical significance.

📐 Formulas

One-sample t: t = (x̄ − μ₀) / (s / √n), df = n − 1

One-sample Z: Z = (x̄ − μ₀) / (σ / √n) - use when population σ is known

One-proportion Z: Z = (p̂ − p₀) / √(p₀(1−p₀)/n) - normal approximation; valid when np₀ ≥ 5 and n(1−p₀) ≥ 5

Two-sample Welch's t: t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂), with df by Welch-Satterthwaite equation

Paired t: t = d̄ / (s_d / √n), df = n − 1, where d̄ = mean of pair differences, s_d = their SD

p-value (two-tailed): p = 2 × P(T > |t_obs|) - compare to α; reject H₀ if p < α

Cohen's d (effect size): d = |x̄ − μ₀| / s for one-sample; d = |x̄₁ − x̄₂| / s_pooled for two-sample

📖 How to Use This Calculator

1
Choose your test type: one-sample t (most common when σ is unknown), one-sample Z (σ known), one-proportion Z, two-sample Welch's t (comparing two groups), or paired t (matched observations).
2
Enter your sample statistics: mean, standard deviation, and sample size for each group. For proportion tests enter p̂ (as a decimal) and n. For paired tests enter the mean and SD of the differences.
3
Set the null hypothesis reference value (μ₀ or p₀), choose the tail type (two-tailed for most research), and set α (default 0.05).
4
Click Run Hypothesis Test. The calculator displays all 6 steps: (1) State hypotheses, (2) Set α, (3) Test statistic, (4) p-value, (5) Critical value, (6) Conclusion with Cohen's d effect size.

📝 Example Calculations

Example 1 - Medical Treatment Effectiveness (One-Sample t-Test)

A cardiologist wants to know if a new drug changes mean systolic blood pressure from the known baseline of 130 mmHg. A sample of 25 patients shows x̄ = 124, s = 10. Test at α = 0.05, two-tailed.

t = (124 − 130) / (10 / √25) = −6 / 2 = −3.000, df = 24

p ≈ 0.006 < 0.05 - Reject H₀. The drug significantly changes blood pressure.

Cohen's d = |124 − 130| / 10 = 0.60 - medium effect size.

Result = t = −3.000, p ≈ 0.006 → Reject H₀ (Cohen's d = 0.60)
Try this example →

Example 2 - Manufacturing Quality Test (One-Sample Z-Test)

A factory claims its bolts have mean diameter μ = 12.00 mm with known σ = 0.05 mm. A QA inspector measures n = 36 bolts and finds x̄ = 12.008 mm. Is the process off-spec? α = 0.05, two-tailed.

Z = (12.008 − 12.000) / (0.05 / √36) = 0.008 / 0.00833 = 0.96

p ≈ 0.337 > 0.05 - Fail to Reject H₀. No significant evidence the process is off-spec.

Result = Z = 0.96, p ≈ 0.337 → Fail to Reject H₀
Try this example →

Example 3 - Election Polling (One-Proportion Z-Test)

An exit poll of 500 voters shows 54% (p̂ = 0.54) supporting Candidate A. Is there significant evidence the candidate leads (> 50%)? α = 0.05, one-tailed right.

SE = √(0.50 × 0.50 / 500) = 0.02236; Z = (0.54 − 0.50) / 0.02236 = 1.789

p ≈ 0.037 < 0.05 - Reject H₀. Statistically significant evidence that Candidate A leads.

Result = Z = 1.789, p ≈ 0.037 → Reject H₀
Try this example →

Example 4 - A/B Test (Two-Sample Welch's t-Test)

Website A: mean session time 240 s, s = 45, n = 80. Website B: mean 220 s, s = 60, n = 70. Did redesign improve engagement? α = 0.05, two-tailed.

SE = √(45²/80 + 60²/70) = √(25.31 + 51.43) = √76.74 = 8.76; t = (240 − 220) / 8.76 = 2.283

Welch df ≈ 123; p ≈ 0.024 < 0.05 - Reject H₀. Redesign significantly increased session time.

Result = t = 2.283, p ≈ 0.024 → Reject H₀
Try this example →

Example 5 - Before/After Study (Paired t-Test)

20 students take a study skills course. Mean improvement in test score = 8.5 points, SD of differences = 12.0. Did the course help? α = 0.05, one-tailed right.

t = 8.5 / (12.0 / √20) = 8.5 / 2.683 = 3.168, df = 19

p ≈ 0.003 < 0.05 - Reject H₀. Course significantly improved scores. Cohen's d = 8.5 / 12.0 = 0.71 (medium-large effect).

Result = t = 3.168, p ≈ 0.003 → Reject H₀ (Cohen's d = 0.71)
Try this example →

❓ Frequently Asked Questions

What are the 6 steps of hypothesis testing?+
The 6 standard steps are: (1) State hypotheses - define H₀ (null) and H₁ (alternative); (2) Set the significance level α (e.g., 0.05); (3) Compute the test statistic (Z, t, etc.) from your sample data; (4) Find the p-value - the probability of observing a result at least this extreme if H₀ is true; (5) Determine the critical value - the threshold the test statistic must exceed to reject H₀; (6) State the conclusion - reject or fail to reject H₀, and interpret in context.
What is the difference between H₀ and H₁?+
H₀ (null hypothesis) is the default assumption - usually that there is no effect, no difference, or the parameter equals a specific value. H₁ (alternative hypothesis) is what you are trying to show evidence for - that there is an effect, a difference, or the parameter is greater/less/not equal to the reference. You never 'prove' H₁; you only find sufficient evidence to reject H₀ in its favour.
When should I use a Z-test vs a t-test for means?+
Use a Z-test when the population standard deviation σ is known (rare in practice). Use a t-test when σ must be estimated from the sample (almost always the case). For large samples (n > 30), the t and Z distributions are very similar, but using t is still correct and conservative. This calculator uses the t-distribution for one-sample t and two-sample tests, and the Z-distribution when σ is explicitly provided.
What does p-value mean in hypothesis testing?+
The p-value is the probability of obtaining a test statistic at least as extreme as the observed one, assuming H₀ is true. A small p-value (below α) is evidence against H₀. Crucially, the p-value is NOT the probability that H₀ is true, and it is NOT the probability of making an error. It is a measure of how surprising your data would be under H₀.
What is a Type I and Type II error?+
A Type I error (false positive) is rejecting H₀ when it is actually true. Its probability equals α (e.g., 5%). A Type II error (false negative) is failing to reject H₀ when H₁ is actually true. Its probability is β. Statistical power = 1 − β. Increasing sample size reduces both error types simultaneously. Decreasing α (stricter test) reduces Type I errors but increases Type II errors.
What is Cohen's d and how do I interpret it?+
Cohen's d is a standardised effect size for mean tests: d = |μ₁ − μ₂| / σ_pooled. It expresses how many standard deviations apart the means are. Conventional benchmarks (Jacob Cohen, 1988): d < 0.2 = negligible effect; 0.2–0.5 = small; 0.5–0.8 = medium; > 0.8 = large. A study can be statistically significant (low p) with a negligible effect size (small d) when n is very large, which is why both must be reported.
What is the difference between one-tailed and two-tailed tests?+
A two-tailed test (H₁: μ ≠ μ₀) detects differences in either direction and is appropriate when you have no strong prior directional hypothesis. A one-tailed test (H₁: μ > μ₀ or H₁: μ < μ₀) only tests one direction and has more power in that direction, but misses effects in the other direction. Most academic journals require two-tailed tests unless a directional hypothesis was pre-specified before data collection.
What is the one-proportion Z-test used for?+
The one-proportion Z-test tests whether an observed sample proportion p̂ differs from a hypothesised population proportion p₀. It uses the normal approximation: Z = (p̂ − p₀) / √(p₀(1−p₀)/n). The approximation is valid when np₀ ≥ 5 and n(1−p₀) ≥ 5. Use cases: election polling (is the proportion above 50%?), quality control (is the defect rate below 2%?), A/B testing (did click-through improve?).
What is a two-sample t-test (Welch's test)?+
The two-sample t-test compares the means of two independent groups. This calculator uses Welch's version, which does not assume equal population variances - making it more robust than Student's pooled t-test. Welch's test adjusts the degrees of freedom using the Welch-Satterthwaite equation: df ≈ (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]. This is always at least as good as the pooled test and is recommended as the default.
What is the paired t-test?+
The paired t-test (dependent samples t-test) is used when two measurements are linked - typically before/after measurements on the same subjects, or matched-pair experimental designs. Instead of comparing group means, it computes the difference for each pair and performs a one-sample t-test on those differences (H₀: μ_d = 0). The paired test removes between-subject variability, making it more powerful than an independent two-sample test for the same data.