|
Equating and linking tests |
Up Previous Next |
|
Test Equating and linking are usually straightforward with Winsteps, but do require clerical care. The more thought is put into test construction and data collection, the easier the equating will be.
Imagine that Test A (the more definitive test, if there is one) has been given to one sample of persons, and Test B to another. It is now desired to put all the items together into one item hierarchy, and to produce one set of measures encompassing all the persons.
Initially, analyze each test separately. Go down the "Diagnosis" pull-down menu. If the tests don't make sense separately, they won't make sense together.
There are several equating methods which work well with Winsteps. Test equating is discussed in Bond & Fox "Applying the Rasch model", and earlier in Wright & Stone, "Best Test Design", George Ingebo "Probability in the Measure of Achievement" - all available from www.rasch.org/books.htm
Concurrent or One-step Equating All the data are entered into one big array. This is convenient but has its hazards. Off-target items can introduce noise and skew the equating process, CUTLO= and CUTHI= may remedy targeting deficiencies. Linking designs forming long chains require much tighter than usual convergence criteria. Always cross-check results with those obtained by one of the other equating methods.
Common Item Equating This is the best and easiest equating method. The two tests share items in common, preferably at least 5 spread out across the difficulty continuum.
Step 1. From the separate analyses, crossplot the difficulties of the common items, with Test B on the y-axis and Test A on the x-axis. The slope of the best fit is: slope = (S.D. of Test B common items) / (S.D. of Test A common items) i.e., the line through the point at the means of the common items and through the (mean + 1 S.D.). This should have a slope value near 1.0. If it does, then the first approximation: for Test B measures in the Test A frame of reference: Measure (B) - Mean(B common items) + Mean(A common items) => Measure (A)
Step 2. Examine the scatterplot. Points far away from the best fit line indicate items that have behaved differently on the two occasions. You may wish to consider these to be no longer common items. Drop the items from the plot and redraw the best fit line. Items may be off the diagonal, or exhibiting large misfit because they are off-target to the current sample. This is a hazard of vertical equating. CUTLO= and CUTHI= may remedy targeting deficiencies.
Step 3a. If the best-fit slope remains far from 1.0, then there is something systematically different about Test A and Test B. You must do "Celsius - Fahrenheit" equating. Test A remains as it stands. Include in the Test B control file: USCALE = S.D. (A common items) / S.D.(B common items) UMEAN = Mean(A common items) - Mean(B common items) and reanalyze Test B. Test B is now in the Test A frame of reference, and the person measures from Test A and Test B can be reported together.
Step 3b. The best-fit slope is near to 1.0. Suppose that Test A is the "benchmark" test. Then we do not want responses to Test B to change the results of Test A. From a Test A analysis produce IFILE= and SFILE= (if there are rating or partial credit scales). Edit the IFILE= and SFILE= to match Test B item numbers and rating (or partial credit) scale. Use them as an IAFILE= and SAFILE= in a Test B analysis. Test B is now in the same frame of reference as Test A, so the person measures and item difficulties can be reported together
Step 3c. The best-fit slope is near to 1.0. Test A and Test B have equal status, and you want to use both to define the common items. Use the MFORMS= command to combine the data files for Test A and Test B into one analysis. The results of that analysis will have Test A and Test B items and persons reported together.
Items ------------ |||||||||||| |||||||||||| Test A Persons |||||||||||| |||||||||||| |||||||||||| |||||||||||| ------------ - - - ------------ | | | |||||||||||| | | | |||||||||||| Test B Persons | | | |||||||||||| | | | |||||||||||| | | | |||||||||||| | | | |||||||||||| - - - ------------
Partial Credit items "Partial credit" values are much less stable than dichotomies. Rather than trying to equate across the whole partial credit structure, one usually needs to assert that, for each item, a particular "threshold" or "step" is the critical one for equating purposes. Then use the difficulties of those thresholds for equating. This relevant threshold for an item is usually the transition point between the two most frequently observed categories - the Rasch-Andrich threshold - and so the most stable point in the partial credit structure.
Stocking and Lord iterative procedure The Stocking and Lord (1983) present an iterative common-item procedure in which items exhibiting DIF across tests are dropped from the link until no items exhibiting inter-test DIF remain. A known hazard is that if the DIF distribution is skewed, the procedure trims the longer tail and the equating will be biased. To implement the Stocking and Lord procedure in Winsteps, code each person (in the person id label) according to which test form was taken. Then request a DIF analysis of item x person-test-code (Table 30). Drop items exhibiting DIF from the link, by coding them as different items in different tests.
Stocking and Lord (1983) Developing a common metric in item response theory. Applied Psychological Measurement 7:201-210.
Common Person Equating Some persons have taken both tests, preferably at least 5 spread out across the ability continuum.
Step 1. From the separate analyses, crossplot the abilities of the common persons, with Test B on the y-axis and Test A on the x-axis. The slope of the best-fit line i.e., the line through the point at the means of the common persons and through the (mean + 1 S.D.) point should have slope near 1.0. If it does, then the intercept of the line with the x-axis is the equating constant.
First approximation: Test B measures in the Test A frame of reference = Test B measure + x-axis intercept.
Step 2. Examine the scatterplot. Points far away from the best fit line indicate persons that have behaved differently on the two occasions. You may wish to consider these to be no longer common persons. Drop the persons from the plot and redraw the best fit line.
Step 3a. If the best-fit slope remains far from 1.0, then there is something systematically different about Test A and Test B. You must do "Celsius - Fahrenheit" equating. Test A remains as it stands. The slope of the best fit is: slope = (S.D. of Test B common persons) / (S.D. of Test A common persons) Include in the Test B control file: USCALE = the value of 1/slope UMEAN = the value of the x-intercept and reanalyze Test B. Test B is now in the Test A frame of reference, and the person measures from Test A and Test B can be reported together.
Step 3b. The best-fit slope is near to 1.0. Suppose that Test A is the "benchmark" test. Then we do not want responses to Test B to change the results of Test A. From a Test A analysis produce PFILE= Edit the PFILE= to match Test B person numbers Use it as a PAFILE= in a Test B analysis. Test B is now in the same frame of reference as Test A, so the person measures and person difficulties can be reported together
Step 3c. The best-fit slope is near to 1.0. Test A and Test B have equal status, and you want to use both to define the common persons. Use your text editor or word processor to append the common persons' Test B responses after their Test A ones, as in the design below. Then put the rest of the Test B responses after the Test A responses, but aligned in columns with the common persons's Test B responses. Perform an analysis of the combined data set. The results of that analysis will have Test A and Test B persons and persons reported together.
Test A items Test B items |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------||-----------------------------| Common Person |-----------------------------| |-----------------------------||-----------------------------| Common Person |-----------------------------| |-----------------------------| |-----------------------------||-----------------------------| Common Person |-----------------------------| |-----------------------------| |-----------------------------||-----------------------------| Common Person |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------| |-----------------------------|
Virtual Equating of Test Forms The two tests share no items or persons in common, but the items cover similar material.
Step 1. Identify pairs of items of similar content and difficulty in the two tests. Be generous about interpreting "similar" at this stage. These are the pseudo-common items.
Steps 2-4: simple: The two item hierarchies (Table 1 using short clear item labels) are printed and compared, equivalent items are identified. The sheets of paper are moved vertically relative to each other until the overall hierarchy makes the most sense. The value on Test A corresponding to the zero on Test B is the UMEAN= value to use for Test B. If the item spacing on one test appear expanded or compressed relative to the other test, use USCALE= to compensate.
Or:
Step 2. From the separate analyses, crossplot the difficulties of the pairs of items, with Test B on the y-axis and Test A on the x-axis. The slope of the best-fit line i.e., the line through the point at the means of the common items and through the (mean + 1 S.D.) point should have slope near 1.0. If it does, then the intercept of the line with the x-axis is the equating constant.
First approximation: Test B measures in the Test A frame of reference = Test B measure + x-axis intercept.
Step 3. Examine the scatterplot. Points far away from the best fit line indicate items that are not good pairs. You may wish to consider these to be no longer paired. Drop the items from the plot and redraw the best fit line.
Step 4. The slope of the best fit is: slope = (S.D. of Test B common items) / (S.D. of Test A common items) Include in the Test B control file: USCALE = the value of 1/slope UMEAN = the value of the x-intercept and reanalyze Test B. Test B is now in the Test A frame of reference, and the person and item measures from Test A and Test B can be reported together.
Random Equivalence Equating The samples of persons who took both tests are believed to be randomly equivalent. Or, less commonly, the samples of items in the tests are believed to be randomly equivalent.
Step 1. From the separate analyses of Test A and Test B, obtain the means and sample standard deviations of the two person samples (including extreme scores).
Step 2. To bring Test B into the frame of reference of Test A, adjust by the difference between the means of the person samples and user-rescale by the ratio of their sample standard deviations.
Include in the Test B control file: USCALE = value of (S.D. person sample for Test A) / (S.D. person sample for Test B) UMEAN = value of (mean for Test A) - (mean for Test B * USCALE) and reanalyze Test B.
Check: Test B should now report the same sample mean and sample standard deviation as Test A.
Test B is now in the Test A frame of reference, and the person measures from Test A and Test B can be reported together.
Linking Tests with Common Items Here is an example: A. The first test (50 items, 1,000 students) B. The second test (60 items, 1,000 students) C. A linking test (20 items from the first test, 25 from the second test, 250 students)
Here is a typical Rasch approach. It is equivalent to applying the "common item" linking method twice.
(a) Rasch analyze each test separately to verify that all is correct.
(b) Cross-plot the item difficulties for the 20 common items between the first test and the linking test. Verify that the link items are on a statistical trend line parallel to the identity line. Omit from the list of linking items, any items that have clearly changed relative difficulty. If the slope of the trend line is not parallel to the identity line (45 degrees), then the test discrimination has changed. The test linking will use a "best fit to trend line" conversion: Corrected measure on test 2 in test 1 frame-of-reference = ((observed measure on test 2 - mean measure of test 2 link items)*(SD of test 1 link items)/(SD of test 1 link items)) + mean measure of test 1 link items
(c) Cross-plot the item difficulties for the 25 common items between the second test and the linking test. Repeat (b).
(d1) If both trend lines are approximately parallel to the identity line, than all three tests are equally discriminating, and the simplest equating is "concurrent". Put all 3 tests in one analysis.You can use the MFORMS= command to put all items into one analysis. You can also selectively delete items using the Specification pull-down menu in order to construct measure-to-raw score conversion tables for each test, if necessary. Or you can use a direct arithmetical adjustment to the measures based on the mean differences of the common items: www.rasch.org/memo42.htm "Linking tests".
(d2) If best-fit trend lines are not parallel to the identity line, then tests have different discriminations. Equate the first test to the linking test, and then the linking test to the second test, using the "best fit to trend line" conversion, shown in (b) above. You can also apply the "best fit to trend" conversion to Table 20 to convert every possible raw score. |
Help for WINSTEPS® Rasch Measurement Software: www.winsteps.com.