﻿ Table 7 Agreement Statistics

# The percent of times those ratings are identical is reported, along with its expected value. This supports an investigation as to whether raters are rating as "independent experts" or as "rating machines". The report is:

Table 7.3.1  Reader Measurement Report  (arranged by MN).

------------------------------------------------------------------------------------------------

| Obsvd   Obsvd  Obsvd  Fair-M|        Model | Infit      Outfit   | Exact Agree. |            |

| Score   Count Average Avrage|Measure  S.E. |MnSq ZStd  MnSq ZStd | Obs %  Exp % | Nu Reader  |

------------------------------------------------------------------------------------------------

|   1524    288     5.3   5.26|   -.30   .05 | 1.2   2    1.2   2  |  28.2   20.9 |  8 8       |

|   1455    288     5.1   5.00|   -.16   .05 |  .5  -7     .5  -7  |  30.8   21.6 |  4 4       |

....

------------------------------------------------------------------------------------------------

RMSE (Model)  .05 Adj S.D.  .19  Separation  4.02  Strata  5.69  Reliability  .94

......

Inter-Rater agreement opportunities: 60480  Exact agreements: 17838 = 29.5%  Expected: 13063.2 = 21.6%

------------------------------------------------------------------------------------------------

Exact Agree. is exact agreements under identical rating conditions. Agreement on qualitative levels relative to the lowest observed qualitative level.

So, imagine all your ratings are 4,5,6 and all my ratings are 1,2,3.

If we use the (shared) Rating Scale model. Then we will have no exact agreements.

But if we use the (individual) Partial Credit model, #, then we agree when you rate a 4 (your bottom observed category) and I rate a 1 (my bottom observed category). Similarly, your 5 agrees with my 2, and your 6 agrees with my 3.

If you want "exact agreement" to mean "exact agreement of data values", then please use the Rating Scale model statistics.

Obs % = Observed % of exact agreements between raters on ratings under identical conditions.

Exp % = Expected % of exact agreements between raters on ratings under identical conditions, based on Rasch measures.
If Obs % ≈  Exp % then the raters may be behaving like independent experts.
If Obs % » Exp % then the raters may be behaving like "rating machines".

Here is the computation for "Expected Agreement %". We pair up another rater with the target rater who rated the same ratee on the same item of the same task of the same ......, so the raters rated the same performance under identical circumstance.

Then, for each rater we have an observed rating. They agree or not. The percentage of times raters agree with the target rater is the "Observed Agreement%"

For each rater we also have an (average) expected rating based on the Rasch measures. The (average) expected ratings will not agree unless the raters have the same leniency/severity measure.

But we also have the Rasch-model-based probabilities for each category of the rating scale for each rater. Suppose this is a 1,2,3 (3-category) rating scale.

 Rater A Rater B Expected agreement between Raters A and B (assuming they are rating independently) probability of category 1 = 10% probability of category 2 = 40% probability of category 3 = 50% probability of category 1 = 20% probability of category 2 = 60% probability of category 3 = 20% Category 1 10%*20% = 2% Category 2 40%*60% = 24% Category 3 50%*20% = 10% Expected agreement in any category = 2+24+10% = 36%

This expected-agreement computation is performed over all pairs of raters and averaged to obtain the reported "Expected Agreement %".

Higher than expected agreement indicates statistical local dependence among the raters. This biases all the standard errors towards zero. An approximate guideline is:
"True" Standard error = "Reported Standard Error" * Maximum( 1, sqrt (Exact agreements / Expected)) for all elements.

In this example, the inflator for the S.E.'s of all elements of all facets approximates sqrt( 17838/13063.2) = 1.17.

Alternatively, deflate the reported person-facet reliability, R, in accordance with the extent to which the raters are not independent. Based on the Spearman-Brown prophecy formula, an approximation is:
T = (100 - observed exact agreement%) / (100 - expected exact agreement%)
deflated reliability = T * R / ( (1-R) + T * R)

Example: 100 raters with a wide range of rater severity/leniency:

 Exact agreements 781=18.8% Expected 577.5=13.9%

With this large spread of rater severities, the prediction is that only 13.9% of the observations will show the raters giving the same rating under the same conditions. This accords with the wide range of severities.

There is somewhat more agreement than this in the data, 18.8%. This is typical of the psychology of rater behavior. We are conditioned from baby-hood to agree with what we conceive to be the expectations of others. This behavior continues even for expert raters. Subconsciously they continue to have a mental pressure to agree with the expectations of others. In this case, that pressure has increased observed agreement from 13.9% to 18.8%.

Whether you report this depends on the purpose for your paper. If it is an investigation into rater behavior, then this provides empirical evidence for a psychological conjecture. If your paper is a validity study of the instrument, then this aspect is probably too obscure to be meaningful for your audience.

See more at Inter-rater Reliability and Inter-rater correlations

Help for Facets Rasch Measurement Software: www.winsteps.com Author: John Michael Linacre.

The Languages of Love: draw a map of yours!

 Forum Rasch Measurement Forum to discuss any Rasch-related topic

Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 3rd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments, George Engelhard, Jr. & Stefanie Wind Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish: Análisis de Rasch para todos, Agustín Tristán Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez
Winsteps Tutorials Facets Tutorials Rasch Discussion Groups

Coming Winsteps & Facets Events
May 22 - 24, 2018, Tues.-Thur. EALTA 2018 pre-conference workshop (Introduction to Rasch measurement using WINSTEPS and FACETS, Thomas Eckes & Frank Weiss-Motz), https://ealta2018.testdaf.de
May 25 - June 22, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 27 - 29, 2018, Wed.-Fri. Measurement at the Crossroads: History, philosophy and sociology of measurement, Paris, France., https://measurement2018.sciencesconf.org
June 29 - July 27, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
July 25 - July 27, 2018, Wed.-Fri. Pacific-Rim Objective Measurement Symposium (PROMS), (Preconference workshops July 23-24, 2018) Fudan University, Shanghai, China "Applying Rasch Measurement in Language Assessment and across the Human Sciences" www.promsociety.org
Aug. 10 - Sept. 7, 2018, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 12 - Nov. 9, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

Our current URL is www.winsteps.com