Student t-test blows the type I error rate. Tukey’s test controls the type I error rate.
One-way ANOVA determines whether there are any statistically significant difference between the means of two or more independent groups. It is important to realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups were significantly different from each other; it only tells you there are at least two groups were different. For example, you may want to know if the average height differed among Chinese, Indian, Japanese, and Korean. If the one-way ANOVA returns a p-value < 0.05, people usually conclude that the average heights of people in these four contries is different.
After detecting some differences among the groups, interest centers on which specific groups are different. If you have no prior hypotheses about which specific groups might differ, you can compare each pair of two-groups. Can we use Student t-test, as it is typically used when comparing two independent groups? The answer is Nope. Let’s do a simulation study.
Suppose you have a factor of 6 groups, A, B, …, F with 4 observations per group:
groups <- factor(rep(LETTERS[1:6], rep(4,6)))
print(groups)
[1] A A A A B B B B C C C C D D D D E E E E F F F F
Levels: A B C D E F
and suppose the response has no relationship to the groups (i.e. the null hypothesis holds)
response = rnorm(length(groups)) # randomly generate 24, e.g. length(groups), values as response.
Since the data is randomly generated, we expect there is no significant result if we perform pairwise comparison using Student t test.
pairwise_t_tests = pairwise.t.test(response, groups, p.adjust.method = 'none')
print(pairwise_t_tests)
Pairwise comparisons using t tests with pooled SD
data: response and groups
A B C D E
B 0.89 - - - -
C 0.53 0.62 - - -
D 0.48 0.40 0.19 - -
E 0.49 0.58 0.95 0.17 -
F 0.86 0.74 0.42 0.60 0.38
P value adjustment method: none
Good! There is no p-value less than 0.05 in the previous comparison table.
Now let’s repeat the experiment 1000 times and see how many of them will incorrectly find us at least two significant groups. Because our significant criterion is 5%, we expect to have a Type I error of 5% and 5% of the 1000 simutations return significance. Unfortunately, this is not true. See below,
significant_find = c() # TRUE, if a p value is < 0.05, otherwise FALSE
for(i in 1:1000){
response = rnorm(length(groups))
pairwise_t_tests = pairwise.t.test(response, groups, p.adjust.method = 'none')
significant_find[i] = any(pairwise_t_tests$p.value<0.05, na.rm = TRUE)
}
sum(significant_find==TRUE)/1000
[1] 0.332
The real overall Type I error rate is 33.2% Actually, if the number of groups increase, the situation gets even worse.

| two | three | four | five | six | |
|---|---|---|---|---|---|
| Nominal Type I error | 5.00% | 5.00% | 5.00% | 5.00% | 5.00% |
| Actual overall Type I error | 5.00% | 12.20% | 20.30% | 28.60% | 36.60% |
In fact, to control the type I error at 5%, we need to increase our t-critical value of qt(0.975, 24-6), 2.1009. It tuens out that this value can be calculated using the “Studentized Range distribution”, using qtukey(0.95,6,24-6)/sqrt(2). This is where the Tukey’s test comes from!
significant_find = c() # TRUE, if a p value is < 0.05, otherwise FALSE
for(i in 1:1000){
response = rnorm(length(groups))
lm <- lm(response ~ groups)
aov <- aov(lm)
tukey.test <- TukeyHSD(aov)
significant_find[i] = any(tukey.test$groups[,"p adj"]<0.05, na.rm = TRUE)
}
sum(significant_find==TRUE)/1000
[1] 0.049
As a result, Tukey’s test returned a Type I error rate of 4.9%, which is close to the expected 5%.
Using Student t-tests as the post hoc test after ANOVA increase the type I error rate dramatically. The more groups of comparisons we have, the higher the type I error rate it will suffer. As a conclusion, Tukey’s test is recommanded, as it controls the type I error rate at 5% regardless of number of groups.
It is not uncommon to find what appears to be a conflict between the results of the one-way ANOVA and a post hoc test such as Tukey’s post hoc test where one finds a statistically significant result for one, but not the other. For example, a statistically significant one-way ANOVA, but no pairwise comparison using the Tukey method that is statistically significant. There can be different reasons for this, such as the conservative or liberal nature of a particular test, but fundamentally it is due to the differences in the distributions used in the one-way ANOVA and Tukey post hoc test (Hsu, 1996). Alternately, you can have a statistically significant Tukey post hoc test, but a non-significant one-way ANOVA. Whether the conclusions from both these tests are in agreement depends on the distribution of the means (Kirk, 2013).
For attribution, please cite this work as
Fan (2018, Dec. 6). Metabox-Blog: Why Tukey's test but not t-test after ANOVA?. Retrieved from https://hushuli.github.io/Metabox-Blog.github.io/posts/2018-12-06-why-tukeys-test-but-not-t-test-after-anova/
BibTeX citation
@misc{fan2018why,
author = {Fan, Sili},
title = {Metabox-Blog: Why Tukey's test but not t-test after ANOVA?},
url = {https://hushuli.github.io/Metabox-Blog.github.io/posts/2018-12-06-why-tukeys-test-but-not-t-test-after-anova/},
year = {2018}
}