Interrater and intrarater Reliability 
In Facets Table 7, there are "Reliability" indexes reported for every facet. This reliability is the Spearman Reliability. Cronbach Alpha is also an estimate of Spearman Reliability. This reliability "distinguishes between different levels of severity among" the elements of the facet. Higher reliability = more levels.
Interrater reliability is not the same as Spearman Reliability. For interrater reliability, higher reliability = more similarity. There are three families of interrater reliability statistics. (i) Do the raters agree with each other about the examinee's rating? (ii) Do the raters agree with each other about which examinees are better and which examinees are worse? (iii) Do the raters give the correct rating to the performance?
Interrater reliability (i)is used for passfail decisions about the examinees. Interrater reliability (ii) is used when the rankorder of the examinees is crucial. Interrater reliability (iii) is used when certifying raters.
Intrarater reliability can be deduced from the rater's fit statistics. The lower the meansquare fit, the higher the intrarater reliability. This is because high intrarater reliability implies that the ratings given by the rater can be accurately predicted from each other.
There is no generallyagreed index of interrater reliability (IRR). The choice of IRR depends on the purpose for which the ratings are being collected, and the philosophy underlying the rating process.
For raters, there are a number of steps in deciding what qualityindexes to report:
1. Are the raters intended to act as independent experts or as "rating machines"?
2. Are the ratings reflective of criterionlevels or of relative performance?
3. How are differences in rater leniency to be managed?
4. How are rater disagreements to be managed?
First, you have to decide what type of rater agreement you want.
Do you want the raters to agree exactly with each other on the ratings awarded? The "rater agreement %".
Do you want the raters to agree about which performances are better and which are worse? Correlations
Do you want the raters to have the same leniency/severity? "1  Separation Reliability" or "Fixed Chisquare"
Do you want the raters to behave like independent experts? Rasch fit statistics
Typical indexes include: proportion of exact agreements (Cohen's kappa), correlations, variances (GTheory).
In the literature there is no clear definition of this, so you must decide what the term means for your situation.
A. It can mean "to what extent to do pairs of raters agree on the same rating?". This is the "exact observed agreement" statistic. If you want your raters to act like "rating machines" (human optical scanners), then you expect to see agreement of 90%+. Raters are often trained to act like this.
B. It can mean "are the ratings of pairs of raters highly correlated?". Facets does not report this directly.
C. Are pairs of raters acting like independent experts (the ideal for Facets)? If so the "observed agreements" will be close to the "expected agreements".
D. Do raters have same level of leniency/severity? This is reported by the "Reliability (not interrater)" statistic. For raters we like to see this close to 0, so that the rater measures are not reliably different. We also like to see the "Fixed allsame" chisquare test not be rejected.
Rasch Agreement Interpretation
Observed < Expected Indicates disagreement, normally happens with untrained raters.
Observed ≈ Expected Raters act independently. Need verification with fit statistics.
Observed somewhat >expected Normal for trained raters. Training emphasizes agreement with others but rating requires raters to rate independently.
Observed >> expected Raters do not rate independently. There may be pressure to agree with other raters.
Observed > 90% Raters behave like a rating machine. Seriously consider excluding from the measurement model facet.
There is not a generally agreed definition of "interrater reliability". Do you intend your raters to act like "rating machines" or as "independent experts"? "Rating machines" are expected to give identical ratings under identical conditions. "Independent experts" are expected to show variation under identical conditions. Facets models raters to be "independent experts". An interrater reliability coefficient, IRR, is not computed. But, from one perspective, it is the reverse of the Separation Reliability, i.e., 1  Separation Reliability.
For "rating machines", there are several interrater approaches. For these you need to use other software:
1. Raters must agree on the exact value of the ratings: use a Cohen'sKappa type of interrater reliability index. Cohen's Kappa is (Observed Agreement%  Chance Agreement%)/(100Chance Agreement%) where chance is determined by the marginal category frequencies. A Rasch version of this would use the "Expected Agreement%" for an adjustment based on "chance + rater leniency + rating scale structure". Then the RaschCohen's Kappa would be: (Observed%Expected%)/(100Expected%). Under Raschmodel conditions this would be close to 0.
2. Raters must agree on higher and lower performance ratings: use a correlationbased interrater reliability index.
3. Interrater variance must be much less than interexaminee variance: compare the Rater facet S.D. with the Examinee facet S.D.
When the raters are behaving like rating machines, alternative analytical approaches should be considered, such as Wilson M. & Hoskens M. (2001) The Rater Bundle Model, Journal of Educational and Behavioral Statistics, 26, 3, 283306, or consider omitting the rater facet from your analysis.
The computation
How many ratings are made under identical conditions (usually by different raters) and how often are those ratings in exact agreement? This investigation is done pairwise across all raters. All facets except the Interrater= facet participate in the matching. If the interrater facet is Entered= more than once, only the first entry is active for this comparison.
To exclude dummy facets and irrelevant ones, do a special run with those marked by X in the model statements. For example, facet 1 is persons, facet 2 is gender (sex) (dummy, anchored at zero), facet 3 is rater, facet 4 is item, facet 5 is rating day (dummy, anchored at 0). Then Gender and Rating Day are irrelevant to the pairing of raters:
Interrater=3
Model = ?,X,?,?,X, R6
Raters: Senior scientists 
Junior Scientists 
Traits 
Observation 
InterRater Agreement Opportunities 
Observed Exact Agreement 
Avogadro 
Anne 
Attack 
5 
1 
0.5 (agrees with Cavendish but not Brahe) 
Cavendish 
Anne 
Attack 
5 
1 
0.5 (agrees with Avogadro but not Brahe) 
Brahe 
Anne 
Attack 
6 
1 
0 (disagrees) 






Avogadro 
Anne 
Basis 
5 
1 
1 (agrees) 
Brahe 
Anne 
Basis 
5 
1 
1 
Cavendish 
Anne 
Basis 
5 
1 
1 




Avogadro 
Anne 
Clarity 
3 
1 
0 (disagrees) 
Brahe 
Anne 
Clarity 
4 
1 
0 
Cavendish 
Anne 
Clarity 
5 
1 
0 
In the Table above, "InterRater Agreement Opportunities" are computed for each rater. There is one opportunity for each observation awarded by a rater under the same circumstances (i.e., same person, same item, same task, ....) as another observation. In the Guilford.txt example, there are 105 observations, all in situations where there are multiple raters, so there are 105 agreement opportunities.
"Observed Exact Agreement" is the proportion of times one observation is exactly the same as one of the other observations for which there are the same circumstances. If, under the same circumstances, the raters all agree, then the Exact Agreement is 1 for each observation. If the raters all disagree, then the Exact Agreement is 0 for each observation. If some raters agree, then the Exact agreement for each observation is the fraction of opportunities to agree with other raters. In the Guilford data, there are 35 sets of 3 ratings: 5 sets of complete agreement = 5 *3 =15. There are 18 sets of partial agreement = 18 * 2 * 0.5 = 18. There are 12 sets of no agreement = 12 * 0 = 0. The agreements sum to 33.
By contrast, Fleiss' kappa has the formula: Kappa = (Pobserved  Pchance) / (1  Pchance)
Proportion of observations in category j is reported in Table 8 as the "Count %".
+
 DATA 
 Category Counts Cum.
Score Used % % 
+
 1 4 4% 4%
 2 4 4% 8%
 3 25 24% 31%
 4 8 8% 39%
 5 31 30% 69%
 6 6 6% 74%
 7 21 20% 94%
 8 3 3% 97%
 9 3 3% 100%
+
Pchance = Σ(Count %/100)² = .04² + .04² + .24² + .08² + .30² + .06² + .20² + .03² + .03² = 0.20
Considering "7 Junior scientists + 5 Traits" as 35 "subjects", so that there are three raters, and n = 3 observations for each subject.
Pi = extent of agreement for subject i = (Σ(count of observations for subject i in category j)²  n)/(n(n1))
For 15 sets of complete agreement with 3 observations, Pi = (3²  3) / (3*2) = 1
For 18 sets of partial agreement with 3 observations, Pi = (2² + 1  3)/(3*2) = 0.33
For 12 sets of disagreement with 3 observations, Pi = (1 + 1 + 1  3)/(3*2) = 0
Pobserved = Mean (Pi) = (1*15 + 18*0.33 + 12*0)/35 = 0.60
Fleiss kappa = (0.60  0.20) / (1  0.20) = 0.40 / 0.80 = 0.5, which is considered "moderate agreement".
Here are the results from Guilford's data. Note that Avogadro and Cavendish show much higher agreement rates than the model predicts. It seems that they share something which contrasts with Brahe:
When an anchorfile= is produced, and used for a subsequent analysis with Brahe commented out, then the agreement between Avogadro and Cavendish is twice what is expected!
Help for Facets Rasch Measurement Software: www.winsteps.com Author: John Michael Linacre.
The Languages of Love: draw a map of yours!
For more information, contact info@winsteps.com or use the Contact Form
Facets Rasch measurement software.
Buy for $149. & site licenses.
Freeware student/evaluation download Winsteps Rasch measurement software. Buy for $149. & site licenses. Freeware student/evaluation download 

Stateoftheart : singleuser and site licenses : free student/evaluation versions : download immediately : instructional PDFs : user forum : assistance by email : bugs fixed fast : free update eligibility : backwards compatible : money back if not satisfied Rasch, Winsteps, Facets online Tutorials 

Forum  Rasch Measurement Forum to discuss any Raschrelated topic 
Click here to add your email address to the Winsteps and Facets email list for notifications.
Click here to ask a question or make a suggestion about Winsteps and Facets software.
Coming Winsteps & Facets Events  

May 22  24, 2018, Tues.Thur.  EALTA 2018 preconference workshop (Introduction to Rasch measurement using WINSTEPS and FACETS, Thomas Eckes & Frank WeissMotz), https://ealta2018.testdaf.de 
May 25  June 22, 2018, Fri.Fri.  Online workshop: Practical Rasch Measurement  Core Topics (E. Smith, Winsteps), www.statistics.com 
June 27  29, 2018, Wed.Fri.  Measurement at the Crossroads: History, philosophy and sociology of measurement, Paris, France., https://measurement2018.sciencesconf.org 
June 29  July 27, 2018, Fri.Fri.  Online workshop: Practical Rasch Measurement  Further Topics (E. Smith, Winsteps), www.statistics.com 
July 25  July 27, 2018, Wed.Fri.  PacificRim Objective Measurement Symposium (PROMS), (Preconference workshops July 2324, 2018) Fudan University, Shanghai, China "Applying Rasch Measurement in Language Assessment and across the Human Sciences" www.promsociety.org 
Aug. 10  Sept. 7, 2018, Fri.Fri.  Online workshop: ManyFacet Rasch Measurement (E. Smith, Facets), www.statistics.com 
Oct. 12  Nov. 9, 2018, Fri.Fri.  Online workshop: Practical Rasch Measurement  Core Topics (E. Smith, Winsteps), www.statistics.com 
Our current URL is www.winsteps.com
Winsteps^{®} is a registered trademark
John "Mike" L.'s Wellness Report:
I'm 72, take no medications and, March 2018, my doctor is annoyed with me  I'm too healthy! According to Wikipedia, the human body requires about 30 minerals, maybe more. There are 60 naturallyoccurring minerals in the liquid Mineral Supplement which I take daily. 