You ran a Shapiro-Wilk test, your data failed normality (p = .003), and now your advisor is asking why you're still running a t-test. Or maybe your sample is 18 participants per group and you have no idea whether the Central Limit Theorem saves you. These are the moments when knowing when to use Mann-Whitney instead of t-test stops being academic and starts being urgent.
This guide walks through the decision rules, shows a worked example, and explains how to report results in APA 7 format.
The Short Answer: When to Use Mann-Whitney Instead of T-Test
Use the Mann-Whitney U test (also called the Wilcoxon rank-sum test) when you're comparing two independent groups and at least one of these is true:
- Your dependent variable is ordinal (e.g., Likert ratings, pain scales)
- Your data are continuous but not normally distributed, especially with small samples (n < 30 per group)
- You have outliers that you cannot justify removing
- Your data show strong skew or heavy tails
Use the independent samples t-test when:
- Your dependent variable is continuous (interval or ratio)
- Each group is approximately normally distributed (or n ≥ 30 per group)
- Variances are roughly equal (or you use Welch's correction)
The Mann-Whitney test compares ranks rather than means, which is why it's robust to violations of normality. The trade-off: when assumptions for the t-test are met, the t-test has slightly more statistical power.
The Normality Question: Don't Just Run a Test
Many researchers default to Shapiro-Wilk to decide. That's not wrong, but it's incomplete. With n = 200, Shapiro-Wilk will flag trivial non-normality. With n = 15, it will miss real problems.
A better approach
- Visualize first. Look at histograms and Q-Q plots for each group.
- Check skewness and kurtosis. Values between -1 and +1 are generally fine.
- Run Shapiro-Wilk as a supplement, not as the sole decision.
- Consider sample size. With n ≥ 30 per group, the t-test is reasonably robust to mild non-normality thanks to the Central Limit Theorem.
In StatRyx, you can run all three checks — Shapiro-Wilk, Q-Q plot, and descriptives — in one click before deciding which test to use.
Mann-Whitney vs T-Test: Key Differences
| Feature | Independent t-test | Mann-Whitney U |
|---|---|---|
| What it compares | Means | Distributions (ranks) |
| Data type | Continuous | Ordinal or continuous |
| Normality required | Yes (or n ≥ 30) | No |
| Sensitive to outliers | Yes | No |
| Reports | M, SD, t, df, p | Mdn, U, Z, p, r |
| Effect size | Cohen's d | r or rank-biserial correlation |
| Power when assumptions met | Higher | ~95% of t-test |
Worked Example: Comparing Anxiety Scores Across Two Therapy Conditions
Suppose you're comparing post-treatment anxiety scores between Group A (CBT, n = 22) and Group B (waitlist, n = 20). Anxiety is measured on a 0–40 scale.
Step 1: Check assumptions. Histograms show Group B is heavily right-skewed (skewness = 1.84). Shapiro-Wilk for Group B: W = 0.86, p = .008. The t-test is off the table.
Step 2: Run Mann-Whitney U. Results:
- Group A: Mdn = 14.5
- Group B: Mdn = 22.0
- U = 118.5, Z = -2.47, p = .013
- Effect size r = |Z| / √N = 2.47 / √42 = .38
Step 3: Interpret.
- U = 118.5 is the test statistic based on rank sums. Its raw value isn't intuitive, but the associated Z and p tell the story.
- p = .013 means the probability of seeing this rank difference under the null hypothesis is 1.3%. Significant at α = .05.
- r = .38 is a medium effect size by Cohen's conventions (.10 small, .30 medium, .50 large).
Step 4: Report in APA 7 format.
A Mann-Whitney U test indicated that post-treatment anxiety scores were significantly lower for the CBT group (Mdn = 14.5) than the waitlist group (Mdn = 22.0), U = 118.5, z = -2.47, p = .013, r = .38.
Note: report medians, not means, for Mann-Whitney. Reporting means here would imply the test compares means, which it doesn't (technically it tests stochastic dominance — whether values from one group tend to exceed values from the other).
Common Mistakes to Avoid
Reporting means with Mann-Whitney
Reviewers will catch this. The test ranks values; report medians and IQRs.
Using Mann-Whitney to "rescue" a nonsignificant t-test
Switching tests after seeing results is p-hacking. Decide based on assumptions, not outcomes.
Ignoring effect size
A significant p-value with n = 500 might reflect a trivial difference.
Run this analysis free in StatRyx → Try StatRyx