|
Multiple t-tests - Bonferroni |
Top Up Down A A |
|
Let's assume your hypothesis is "this instrument does not exhibit DIF", and you are going to test the hypothesis by looking at the statistical significance probabilities reported for each t-test in a list of t-tests. Then, by chance, we would expect 1 out of every 20 or so t-tests to report p<.05. So, if there are more than 20 t-tests in the list, then p<.05 for an individual t-test is a meaningless significance. In fact, if we don't see at least one p<.05, we may be surprised!
The Bonferroni correction says, "if any of the t-tests in the list has p<.05/(number of t-tests in the list), then the hypothesis is rejected".
What is important is the number of tests, not how many of the are reported to have p<.05.
If you wish to make a Bonferroni multiple-significance-test correction, compare the reported significance probability with your chosen significance level, e.g., .05, divided by the number of t-tests in the Table. According to Bonferroni, if you are testing the null hypothesis at the p<.05 level: "There is no effect in this test." Then the most significant effect must be p<.05 / (number of item DIF contrasts) for the null hypothesis of no-effect to be rejected.
Question: Winsteps Tables report many t-tests. Should Bonferroni adjustments for multiple comparisons be made?
Reply: It depends on how you are conducting the t-tests. For instance, in Table 30.1. If your hypothesis (before examining any data) is "there is no DIF for this CLASS in comparison to that CLASS on this item", then the reported probabilities are correct.
If you have 20 items, then one is expected to fail the p ≤ .05 criterion. So if your hypothesis (before examining any data) is "there is no DIF in this set of items for any CLASS", then adjust individual t-test probabilities accordingly.
In general, we do not consider the rejection of a hypothesis test to be "substantively significant", unless it is both very unlikely (i.e., statistically significant) and reflects a discrepancy large enough to matter (i.e., to change some decision). If so, even if there is only one such result in a large data set, we may want to take action. This is much like sitting on the proverbial needle in a haystack. We take action to remove the needle from the haystack, even though statistical theory says, "given a big enough haystack, there will probably always be a needle in it somewhere."
A strict Bonferroni correction for n multiple significance tests at joint level α is α /n for each single test. This accepts or rejects the entire set of multiple tests. In an example of a 100 item test with 20 bad items (.005 < p < .01), the threshold values for cut-off with p <= .05 would be: 0.0005, so that the entire set of items is accepted.
Benjamini and Hochberg (1995) suggest that an incremental application of Bonferroni correction overcomes some of its drawbacks. Here is their procedure: i) Perform the n single significance tests. ii) Number them in ascending order by probability P(i) where i=1,n in order. iii) Identify k, the largest value if i for which P(i) = α * i/n iv) Reject the null hypothesis for i = 1, k
In an example of a 100 item test with 20 bad items (.005 < p < .01), the threshold values for cut-off with p <= .05 would be: 0.0005 for the 1st item, .005 for the 10th item, .01 for the 20th item, .015 for the 30th item. So that k would be at least 20 and perhaps more. All 20 bad items have been flagged for rejection.
Benjamini Y. & Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57,1, 289-300. |
Help for WINSTEPS® Rasch Measurement Software: www.winsteps.com.