Table 7.3.1 Reader Measurement Report (arranged by MN).

------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------

| 1524 288 5.3 5.26| -.30 .05 | 1.2 2 1.2 2 | 28.2 20.9 | 8 8 |

| 1455 288 5.1 5.00| -.16 .05 | .5 -7 .5 -7 | 30.8 21.6 | 4 4 |

....

------------------------------------------------------------------------------------------------

RMSE (Model) .05 Adj S.D. .19 Separation 4.02 Strata 5.69 Reliability .94 <-- Spearman

......

Inter-Rater agreement opportunities: 60480 Exact agreements: 17838 = 29.5% Expected: 13063.2 = 21.6%

------------------------------------------------------------------------------------------------

Inter-rater= facet-number to report inter-rater agreement statistics.

In Facets Table 7, there are "Reliability" indexes reported for every facet. This reliability is the Spearman Reliability. Cronbach Alpha is also an estimate of Spearman Reliability. This reliability "distinguishes between different levels of severity among" the elements of the facet. Higher reliability = more levels.

Inter-rater reliability is not the same as Spearman Reliability. For inter-rater reliability, higher reliability = more similarity. There are three families of inter-rater reliability statistics. (i) Do the raters agree with each other about the examinee's rating? (ii) Do the raters agree with each other about which examinees are better and which examinees are worse? (iii) Do the raters give the correct rating to the performance?

Inter-rater reliability (i)is used for pass-fail decisions about the examinees. Inter-rater reliability (ii) is used when the rank-order of the examinees is crucial. Inter-rater reliability (iii) is used when certifying raters.

Intra-rater reliability can be deduced from the rater's fit statistics. The lower the mean-square fit, the higher the intra-rater reliability. This is because high intra-rater reliability implies that the ratings given by the rater can be accurately predicted from each other

Intra-rater consistency would be maximized when a rates rates every performance in the same category. This would be equivalent to the "attenuation paradox" in Classical Test Theory. There is an optimal consistency beyond which validity drops.

So here is an approach:

1. Model each rater to have a unique rating scale. e.g., Models=?,#,?,?,R5

2. We want the category frequencies to be a smooth unimodal distribution with all categories well represented. This will also force the Andrich thresholds to be ordered.

3. We want the average measures for each category to be close to their expectations.

3. We want the mean-square fit statistics for each category to be close to 1.0

A rater who meets these requirements is consistent from a Rasch perspective.

There is no generally-agreed index of inter-rater reliability (IRR). The choice of IRR depends on the purpose for which the ratings are being collected, and the philosophy underlying the rating process.

For raters, there are a number of steps in deciding what quality-indexes to report:
1. Are the raters intended to act as independent experts or as "rating machines"?
2. Are the ratings reflective of criterion-levels or of relative performance?
3. How are differences in rater leniency to be managed?
4. How are rater disagreements to be managed?

First, you have to decide what type of rater agreement you want.

Do you want the raters to agree exactly with each other on the ratings awarded? The "rater agreement %".

Do you want the raters to agree about which performances are better and which are worse? Correlations

Do you want the raters to have the same leniency/severity? "1 - Separation Reliability" or "Fixed Chi-squared"

Do you want the raters to behave like independent experts? Rasch fit statistics

Typical indexes include: proportion of exact agreements (Cohen's kappa), correlations, variances (G-Theory).

In the literature there is no clear definition of this, so you must decide what the term means for your situation.

A. It can mean "to what extent to do pairs of raters agree on the same rating?". This is the "exact observed agreement" statistic. If you want your raters to act like "rating machines" (human optical scanners), then you expect to see agreement of 90%+. Raters are often trained to act like this.

B. It can mean "are the ratings of pairs of raters highly correlated?". Facets does not report this directly.

C. Are pairs of raters acting like independent experts (the ideal for Facets)? If so the "observed agreements" will be close to the "expected agreements".

D. Do raters have same level of leniency/severity? This is reported by the "Reliability (not inter-rater)" statistic. For raters we like to see this close to 0, so that the rater measures are not reliably different. We also like to see the "Fixed all-same" chi-squared test not be rejected.

Rasch Agreement	Interpretation
Observed < Expected	Indicates disagreement, normally happens with untrained raters.
Observed ≈ Expected	Raters act independently. Need verification with fit statistics.
Observed somewhat > expected	Normal for trained raters. Training emphasizes agreement with others, but rating process requires raters to rate independently.
Observed >> expected	Raters do not rate independently. There may be pressure to agree with other raters.
Observed > 90%	Raters behave like a rating machine. Seriously consider excluding from the measurement model. Specify the rater facet as a demographic one ,D

There is not a generally agreed definition of "inter-rater reliability". Do you intend your raters to act like "rating machines" or as "independent experts"? "Rating machines" are expected to give identical ratings under identical conditions. "Independent experts" are expected to show variation under identical conditions. Facets models raters to be "independent experts". An interrater reliability coefficient, IRR, is not computed. But, from one perspective, it is the reverse of the Separation Reliability, i.e., 1 - Separation Reliability.

For "rating machines", there are several inter-rater approaches. For these you need to use other software:

1. Raters must agree on the exact value of the ratings: use a Cohen's-Kappa type of inter-rater reliability index. Cohen's Kappa is (Observed Agreement% - Chance Agreement%)/(100-Chance Agreement%) where chance is determined by the marginal category frequencies. A Rasch version of this would use the "Expected Agreement%" for an adjustment based on "chance + rater leniency + rating scale structure". Then the Rasch-Cohen's Kappa would be: (Observed%-Expected%)/(100-Expected%). Under Rasch-model conditions this would be close to 0.

For Rasch-Kappa, Mojtaba Taghvafard's investigations suggest that:

Rasch-Kappa Value	Meaning
-0.2 to +0.2	model-expected level of agreement
-0.2 to -0.4 +0.2 to +0.4	little more agreement than modeled
=> +0.5 or =<-0.5	high agreement

2. Raters must agree on higher and lower performance ratings: use a correlation-based inter-rater reliability index.

3. Inter-rater variance must be much less than inter-examinee variance: compare the Rater facet S.D. with the Examinee facet S.D.

When the raters are behaving like rating machines, alternative analytical approaches should be considered, such as Wilson M. & Hoskens M. (2001) The Rater Bundle Model, Journal of Educational and Behavioral Statistics, 26, 3, 283-306, or consider specifying the rater facet in your analysis as a demographic facet ,D which has all elements anchored at zero.

The computation

How many ratings are made under identical conditions (usually by different raters) and how often are those ratings in exact agreement? This investigation is done pairwise across all raters. All facets except the Inter-rater= facet participate in the matching. If the inter-rater facet is Entered= more than once, only the first entry is active for this comparison.

To exclude dummy facets and irrelevant ones, do a special run with those marked by X in the model statements. For example, facet 1 is persons, facet 2 is gender (sex) (dummy, anchored at zero), facet 3 is rater, facet 4 is item, facet 5 is rating day (dummy, anchored at 0). Then Gender and Rating Day are irrelevant to the pairing of raters:

Inter-rater=3

Model = ?,X,?,?,X, R6

1.Inter-Rater Agreement Opportunities

Raters: Senior scientists	Junior Scientists	Traits	Observation	Inter-Rater Agreement Opportunities	Observed Exact Agreement
Avogadro	Anne	Attack	5	1	0.5 (agrees with Cavendish but not Brahe)
Cavendish	Anne	Attack	5	1	0.5 (agrees with Avogadro but not Brahe)
Brahe	Anne	Attack	6	1	0 (disagrees)

Avogadro	Anne	Basis	5	1	1 (agrees)
Brahe	Anne	Basis	5	1	1
Cavendish	Anne	Basis	5	1	1

Avogadro	Anne	Clarity	3	1	0 (disagrees)
Brahe	Anne	Clarity	4	1	0
Cavendish	Anne	Clarity	5	1	0

In the Table above, "Inter-Rater Agreement Opportunities" are computed for each rater. There is one opportunity for each observation awarded by a rater under the same circumstances (i.e., same person, same item, same task, ....) as another observation. In the Guilford.txt example, there are 105 observations, all in situations where there are multiple raters, so there are 105 agreement opportunities.

"Observed Exact Agreement" is the proportion of times one observation is exactly the same as one of the other observations for which there are the same circumstances. If, under the same circumstances, the raters all agree, then the Exact Agreement is 1 for each observation. If the raters all disagree, then the Exact Agreement is 0 for each observation. If some raters agree, then the Exact agreement for each observation is the fraction of opportunities to agree with other raters. In the Guilford data, there are 35 sets of 3 ratings: 5 sets of complete agreement = 5 *3 =15. There are 18 sets of partial agreement = 18 * 2 * 0.5 = 18. There are 12 sets of no agreement = 12 * 0 = 0. The agreements sum to 33.

2.Fleiss' Kappa

By contrast, Fleiss' kappa has the formula: Kappa = (Pobserved - Pchance) / (1 - Pchance) computed across all raters and rating-scale categories.

Proportion of observations in each category j is reported in Table 8 as the "Count %".

+-----------------------

| DATA |

| Category Counts Cum.|

|Score Used % % |

|----------------------+

| 1 4 4% 4%|

| 2 4 4% 8%|

| 3 25 24% 31%|

| 4 8 8% 39%|

| 5 31 30% 69%|

| 6 6 6% 74%|

| 7 21 20% 94%|

| 8 3 3% 97%|

| 9 3 3% 100%|

+-----------------------

Pchance = Σ(Count %/100)² = .04² + .04² + .24² + .08² + .30² + .06² + .20² + .03² + .03² = 0.20

Then, considering "7 Junior scientists + 5 Traits" as 35 "subjects", so that there are three raters, and n = 3 observations for each subject.

Table of Inter-rater Agreements

Subject		Senior Scientist = Rater
Junior Scientist	Trait	1	2	3	Agreements
1	2	5	5	5	Complete
1	5	3	3	3	Complete
2	2	7	7	7	Complete
2	3	5	5	5	Complete
7	4	5	5	5	Complete
1	3	3	4	5	None
1	4	5	6	7	None
2	1	9	8	7	None
3	4	7	6	5	None
3	5	1	6	5	None
4	3	1	4	3	None
4	5	3	5	1	None
5	4	8	2	7	None
5	5	5	3	7	None
6	2	5	4	3	None
6	5	1	2	3	None
7	5	5	4	7	None
1	1	5	6	5	Partial
2	4	8	7	7	Partial
2	5	5	2	5	Partial
3	1	3	4	3	Partial
3	2	3	5	5	Partial
3	3	3	3	5	Partial
4	1	7	5	5	Partial
4	2	3	6	3	Partial
4	4	3	5	3	Partial
5	1	9	2	9	Partial
5	2	7	4	7	Partial
5	3	7	3	7	Partial
6	1	3	4	3	Partial
6	3	3	6	3	Partial
6	4	5	4	5	Partial
7	1	7	3	7	Partial
7	2	7	3	7	Partial
7	3	5	5	7	Partial

Pi = extent of agreement for subject i = (Σ(count of observations for subject i in category j)² - n)/(n(n-1))

For 5 subjects where all 3 raters rate in the same category, Pi = (3² - 3) / (3*2) = 1

For 18 subjects where 2 raters rate in the same category, and 1 rater in a different category, Pi = (2² + 1 - 3)/(3*2) = 0.33

For 12 subjects where all 3 raters rate in different categories, Pi = (1² + 1² + 1² - 3)/(3*2) = 0

Pobserved = Mean (Pi) = (1*5 + 18*0.33 + 12*0)/35 = 0.31

Fleiss kappa = (0.31 - 0.20) / (1 - 0.20) = 0.11 / 0.80 = 0.14, which is considered "slight agreement" in Wikipedia - Fleiss Kappa

Here are the results from Guilford's data. Note that Avogadro and Cavendish show much higher agreement rates than the model predicts. It seems that they share something which contrasts with Brahe:

------------------------- -------------------------------------

| Obsvd Obsvd Obsvd | Exact Agree. | |

| Score Count Average | Obs % Exp % | N Senior scientists|

------------------------- -------------------------------------

| 156 35 4.5 | 21.4 25.2 | 2 Brahe |

| 171 35 4.9 | 35.7 25.8 | 1 Avogadro |

| 181 35 5.2 | 37.1 25.3 | 3 Cavendish |

----------------------------------------------------------------------------------------

Rater agreement opportunities: 105 Exact agreements: 33 = 31.4% Expected: 26.7 = 25.4%

----------------------------------------------------------------------------------------

When an anchorfile= is produced, and used for a subsequent analysis with Brahe commented out, then the agreement between Avogadro and Cavendish is twice what is expected!

------------------------- -------------------------------------

| Obsvd Obsvd Obsvd | Exact Agree. | |

| Score Count Average | Obs % Exp % | N Senior scientists|

------------------------- -------------------------------------

| 171 35 4.9 | 51.4 25.9 | 1 Avogadro |

| 181 35 5.2 | 51.4 25.9 | 3 Cavendish |

------------------------- -------------------------------------

3.Krippendorff's Alpha

"Krippendorff’s alpha (Kα) is a reliability coefficient developed to measure the agreement among

observers, coders, judges, raters, or measuring instruments drawing distinctions among typically

unstructured phenomena or assign computable values to them. Kα emerged in content analysis but

is widely applicable wherever two or more methods of generating data are applied to the same

set of objects, units of analysis, or items and the question is how much the resulting data can be

trusted to represent something real." K. Krippendorff (2011) Computing Krippendorff’s Alpha-Reliability.

Kα = 1 - Observed Disagreement / Expected Disagreement

where Kα =1 is perfect agreement, Kα=0 is agreement by chance or worse.

1,m = rating-scale categories

A = Agreement opportunities = count of situations where a pair of raters have the same elements for all other facets

nj = count of observations of category j across all raters

N = sum (nj) for j=1,m

C = Agreement by chance = sum (nj*nj)/N for j=1,m - N

a) assuming all raters have the same leniency/severity!

O = Observed agreement = count of situations where pairs of raters have given the same rating

E = Expected agreement = sum(pj*pj) for j=1 to m across A, with measures of each situation (except raters)

b) assuming all raters have their own severity

O = Observed agreement = count of situations where pairs of raters have rating residuals that differ by 0.5 or less.

E = Expected agreement = sum(pj*pk) for j=1 to m, k=1 to m, and residual difference<=0.5, across A, with measures of each situation (with raters)

Help for Facets Rasch Measurement and Rasch Analysis Software: www.winsteps.com Author: John Michael Linacre.

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn, 2024 George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
As an Amazon Associate I earn from qualifying purchases. This does not change what you pay.

Coming Rasch-related Events
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

Inter-rater and intra-rater Reliability

------------------------- -------------------------------------

| Obsvd Obsvd Obsvd | Exact Agree. | |

| Score Count Average | Obs % Exp % | N Senior scientists|

------------------------- -------------------------------------

| 156 35 4.5 | 21.4 25.2 | 2 Brahe |

| 171 35 4.9 | 35.7 25.8 | 1 Avogadro |

| 181 35 5.2 | 37.1 25.3 | 3 Cavendish |

----------------------------------------------------------------------------------------

Rater agreement opportunities: 105 Exact agreements: 33 = 31.4% Expected: 26.7 = 25.4%

----------------------------------------------------------------------------------------

------------------------- -------------------------------------

| Obsvd Obsvd Obsvd | Exact Agree. | |

| Score Count Average | Obs % Exp % | N Senior scientists|

------------------------- -------------------------------------

| 171 35 4.9 | 51.4 25.9 | 1 Avogadro |

| 181 35 5.2 | 51.4 25.9 | 3 Cavendish |

------------------------- -------------------------------------

"Krippendorff’s alpha (Kα) is a reliability coefficient developed to measure the agreement among

observers, coders, judges, raters, or measuring instruments drawing distinctions among typically

unstructured phenomena or assign computable values to them. Kα emerged in content analysis but

is widely applicable wherever two or more methods of generating data are applied to the same

set of objects, units of analysis, or items and the question is how much the resulting data can be

trusted to represent something real." K. Krippendorff (2011) Computing Krippendorff’s Alpha-Reliability.

Kα = 1 - Observed Disagreement / Expected Disagreement

where Kα =1 is perfect agreement, Kα=0 is agreement by chance or worse.

1,m = rating-scale categories

A = Agreement opportunities = count of situations where a pair of raters have the same elements for all other facets

nj = count of observations of category j across all raters

N = sum (nj) for j=1,m

a) assuming all raters have the same leniency/severity!

O = Observed agreement = count of situations where pairs of raters have given the same rating

b) assuming all raters have their own severity

O = Observed agreement = count of situations where pairs of raters have rating residuals that differ by 0.5 or less.

Questions, Suggestions? Want to update Winsteps or Facets? Please email Mike Linacre, author of Winsteps mike@winsteps.com