Fit diagnosis: infit outfit meansquare standardized 
Remember that our purpose is to measure the persons, not to optimize the items and raters. A good approach is to compute the person measures based on all the different item selections that you think are reasonable. Start with all the items, and then reduce to smaller sets of items. Crossplot the person measures. If the person measures are collinear, use the larger set of items. If the person measures are not collinear, use the set of items which produces the more meaningful set of person measures.
What do Infit Meansquare, Outfit Meansquare, Infit Zstd (zstandardized), Outfit Zstd (zstandardized) mean?
Every observation contributes to both infit and outfit. But the weighting of the observations differs. Ontarget observations contribute less to outfit than to infit.
Outfit: outliersensitive fit statistic. This is based on the conventional chisquare statistic. This is more sensitive to unexpected observations by persons on items that are relatively very easy or very hard for them (and viceversa).
Infit: inlierpatternsensitive fit statistic. This is based on the chisquare statistic with each observation weighted by its statistical information (model variance). This is more sensitive to unexpected patterns of observations by persons on items that are roughly targeted on them (and viceversa).
Outfit = sum ( residual ² / information ) / (count of residuals) = average ( (standardized residuals)²) = chisquare/d.f. = meansquare
Infit = sum ( (residual ² / information) * information ) / sum(information) = average ( (standardized residuals)² * information) = informationweighted meansquare
Meansquare: this is the chisquare statistic divided by its degrees of freedom. Consequently its expected value is close to 1.0. Values greater than 1.0 (underfit) indicate unmodeled noise or other source of variance in the data  these degrade measurement. Values less than 1.0 (overfit) indicate that the model predicts the data too well  causing summary statistics, such as reliability statistics, to report inflated statistics. See further dichotomous and polytomous meansquare statistics.
ZStandardized: these report the statistical significance (probability) of the chisquare (meansquare) statistics occurring by chance when the data fit the Rasch model. "Standardized" means "transformed to conform to a unitnormal distribution". The values reported are unitnormal deviates, in which .05% 2sided significance corresponds to 1.96. Overfit is reported with negative values. These are also called tstatistics reported with infinite degrees of freedom.
ZSTD probabilities: 

1.00 1.96 2.00 2.58 3.00 4.00 5.00 
p= .317 .050 .045 .01 .0027 .00006 .0000006 
Infit was an innovation of Ben Wright's (G. Rasch, 1980, Afterword). Ben noticed that the standard statistical fit statistic (that we now call Outfit) was highly influenced by a few outliers (very unexpected observations). Ben need a fit statistic that was more sensitive to the overall pattern of responses, so he devised Infit. Infit weights the observations by their statistical information (model variance) which is higher in the center of the test and lower at the extremes. The effect is to make Infit less influenced by outliers, and more sensitive to patterns of inlying observations.
Ben Wright's Infit and Outfit statistics (e.g., RSA, p. 100) are initially computed as meansquare statistics (i.e., chisquare statistics divided by their degrees of freedom). Their likelihood (significance) is then computed. This could be done directly from chisquare tables, but the convention is to report them as unit normal deviates (i.e., tstatistics corrected for their degrees for freedom). I prefer to call them zstatistics, but the Rasch literature has come to call them tstatistics, so now I do to. It is confusing because they are not strictly Student tstatistics (for which one needs to know the degrees of freedom) but are random normal deviates.
General guidelines:
First, investigate negative pointmeasure or pointbiserial correlations. Look at the Distractor Tables, e.g., 10.3. Remedy miskeys, data entry errors, etc.
Then, the general principle is:
Investigate outfit before infit,
meansquare before t standardized,
high values before low or negative values.
There is an asymmetry in the implications of outofrange high and low meansquares (or positive and negative tstatistics). High meansquares (or positive tstatistics) are a much greater threat to validity than low meansquares (or negative fit statistics).
Poor fit does not mean that the Rasch measures (parameter estimates) aren't additive. The Rasch model forces its estimates to be additive. Misfit means that the reported estimates, though effectively additive, provide a distorted picture of the data.
The fit analysis is a report of how well the data accord with those additive measures. So a MnSq >1.5 suggests a deviation from unidimensionality in the data, not in the measures. So the unidimensional, additive measures present a distorted picture of the data.
High outfit meansquares may be the result of a few random responses by low performers. If so, drop with PDFILE= these performers when doing item analysis, or use EDFILE= to change those response to missing.
High infit meansquares indicate that the items are misperforming for the people on whom the items are targeted. This is a bigger threat to validity, but more difficult to diagnose than high outfit.
Meansquares show the size of the randomness, i.e., the amount of distortion of the measurement system. 1.0 are their expected values. Values less than 1.0 indicate observations are too predictable (redundancy, model overfit). Values greater than 1.0 indicate unpredictability (unmodeled noise, model underfit). Meansquares usually average to 1.0, so if there are high values, there must also be low ones. Examine the high ones first, and temporarily remove them from the analysis if necessary, before investigating the low ones.
Zstd are ttests of the hypotheses "do the data fit the model (perfectly)?" ZSTD (standardized as a zscore) is used of a ttest result when either the ttest value has effectively infinite degrees of freedom (i.e., approximates a unit normal value) or the Student's tstatistic value has been adjusted to a unit normal value. They show the improbability (significance). 0.0 are their expected values. Less than 0.0 indicate too predictable. More than 0.0 indicates lack of predictability. If meansquares are acceptable, then Zstd can be ignored. They are truncated towards 0, so that 1.00 to 1.99 is reported as 1. So a value of 2 means 2.00 to 2.99, i.e., at least 2. For exact values, see Output Files. If the test involves less than 30 observations, it is probably too insensitive, i.e., "everything fits". If there are more than 300 observations, it is probably too sensitive, i.e., "everything misfits".
Interpretation of parameterlevel meansquare fit statistics: 

>2.0 
Distorts or degrades the measurement system. 
1.5  2.0 
Unproductive for construction of measurement, but not degrading. 
0.5  1.5 
Productive for measurement. 
<0.5 
Less productive for measurement, but not degrading. May produce misleadingly good reliabilities and separations. 
In general, meansquares near 1.0 indicate little distortion of the measurement system, regardless of the Zstd value.
Evaluate high meansquares before low ones, because the average meansquare is usually forced to be near 1.0.
Meansquare fit statistics will average about 1.0, so, if you accept items (or persons) with large meansquares (low discrimination), then you must also accept the counterbalancing items (or persons) with low meansquares (high discrimination).
Outfit meansquares: influenced by outliers. Usually easy to diagnose and remedy. Less threat to measurement.
Infit meansquares: influenced by response patterns. Usually hard to diagnose and remedy. Greater threat to measurement.
Extreme scores always fit the Rasch model exactly, so they are omitted from the computation of fit statistics. If an extreme score has an anchored measure, then that measure is included in the fit statistic computations.
Anchored runs: Anchor values may not exactly accord with the current data. To the extent that they don't, the fit statistics may be misleading. Anchor values that are too central for the current data tend to make the data appear to fit too well. Anchor values that are too extreme for the current data tend to make the data appear noisy.
Question: Are you contradicting the usual statistical advice about modeldata fit?
Statisticians are usually concerned with "how likely are these data to be observed, assuming they accord with the model?" If it is too unlikely (i.e., significant misfit), then the verdict is "these data don't accord with the model." The practical concern is: "In the imperfect empirical world, data never exactly accord with the Rasch model, but do these data deviate seriously enough for the Rasch measures to be problematic?" The builder of my house followed the same approach (regarding Pythagoras theorem) when building my bathroom. It looked like the walls were square enough for his practical purposes. Some years later, I installed a fulllength rectangular mirror  then I discovered that the walls were not quite square enough for my purposes (so I had to make some adjustments)  so there is always a judgment call. The table of meansquares is my judgment call as a "builder of Rasch measures".
Question: My data contains misfitting items and persons, what should I do?
Let us clarify the objectives here.
A. www.rasch.org/rmt/rmt234g.htm is aimed at the usual situation where someone has administered a test from somewhere to a sample of people, and we, the analysts, are trying to rescue as much of that data as is meaningful. We conservatively remove misfitting items and persons until the data makes reasonable sense. We then anchor those persons and items to their good measures. After reinstating whatever misfitting items and persons we must report, we do the final analysis.
B. A pilot study is wonderfully different. We want to optimize the subset of items. The person sample and the data can be tailored to optimize item selection. Accordingly,
First, even before data analysis, we need to arrange the items into their approximately intended order along the latent variable. With 89 items, item can be conceptually grouped into clusters located at 5 or more levels of the latent variable, probably more than 5. This defines what we want to measure. If we don't know this order, then we will not know whether we have succeeded in measuring what we intended to measure. We may accidentally construct a test that measures a related variable. This happened in one edition of the MMPI where the test constructors intended to measure depression, but produced a scale that measured "depression+lethargy".
Second, we analyze the data and inspect the item hierarchy. Omit any items that are locating in the wrong place on the latent variable. By "omit", I mean give a weight of zero with IWEIGHT=. Then the item stays in the analysis, but does not influence other numbers. This way we can easily reinstate items, if necessary, knowing where they would go if they had been given the usual weight of 1.
Third, reanalyze the data with the pruned item hierarchy. Omit all persons who severely underfit the items, these are contradicting the latent variable. Again, "omit" means PWEIGHT= 0. Also omit persons whose "cooperation" is because they have an overfitting response set, such as the middle category of every item.
Fourth, analyze again. The data should be coherent. Items in the correct order. Persons cooperating. So apply all the other selection criteria, such as content balancing, DIF detection, to this coherent dataset.
Question: Should I report Outfit or Infit?
A chisquare statistic is the sum of squares of standard normal variables. Outfit is a chisquare statistic. It is the sum of squared standardized residuals (which are modeled to be standard normal variables). So it is a conventional chisquare, familiar to most statisticians. Chisquares (including outfit) are sensitive to outliers. For ease of interpretation, this chisquare is divided by its degrees of freedom to have a meansquare form and reported as "Outfit". Consequently I recommend that the Outfit be reported unless there is a strong reason for reporting infit.
In the Rasch context, outliers are often lucky guesses and careless mistakes, so these outlying characteristics of respondent behavior can make a "good" item look "bad". Consequently, Infit was devised as a statistic that downweights outliers and focuses more on the response string close to the item difficulty (or person ability). Infit is the sum of (squares of standard normal variables multiplied by their statistical information). For ease of interpretation, Infit is reported in meansquare form by dividing the weighted chisquare by the sum of the weights. This formulation is unfamiliar to most statisticians, so I recommend against reporting Infit unless the data are heavily contaminated with irrelevant outliers.
Question: Are meansquare values, >2 etc, samplesize dependent?
The meansquares are corrected for sample size: they are the chisquares divided by their degrees of freedom, i.e., sample size. The meansquares answer "how big is the impact of the misfit". The tstatistics answer "how likely are data like these to be observed when the data fit the model (exactly)." In general, the bigger the sample the less likely, so that tstatistics are highly samplesize dependent. We eagerly await the theoretician who devises a statistical test for the hypothesis "the data fit the Rasch model usefully" (as opposed to the current tests for perfectly).
The relationship between meansquare and zstandardized tstatistics is shown in this plot. Basically, the standardized statistics are insensitive to misfit with less than 30 observations and overly sensitive to misfit when there are more than 300 observations.
Question: For my sample of 2400 people, the meansquare fit statistics are close to 1.0, but the
Zvalues associated with the INFIT/OUTFIT values are huge (over 4 to 9.9). What could be causing such high values?
Your results make sense. Here is what has happened. You have a sample of 2,400 people. This gives huge statistically power to your test of the null hypothesis: "These data fit the Rasch model (exactly)." In the nomographs above, a sample size of 2,400 (on the righthandside of the plot) indicates that even a meansquare of 1.2 (and perhaps 1.1) would be reported as misfitting highly significantly. So your meansquares tell us: "these data fit the Rasch model usefully", and the Zvalues tell us: "but not exactly". This situation is often encountered in situations where we know, in advance, that the null hypothesis will be rejected. The Rasch model is a theoretical ideal. Empirical observations never fit the ideal of the Rasch model if we have enough of them. You have more than enough observations, so the null hypothesis of exact modelfit is rejected. It is the same situation with Pythagoras theorem. No empirical rightangledtriangle fits Pythagoras theorem if we measure it precisely enough. So we would reject the null hypothesis "this is a rightangledtriangle" for all triangles that have actually been drawn. But obviously billions of triangle are usefully rightangled.
Example of computation:
Imagine an item with categories j=0 to m. According to the Rasch model, every category has a probability of being observed, Pj.
Then the expected value of the observation is E = sum ( j * Pj )
The model variance (sum of squares) of the probable observations around the expectation is V = sum ( Pj * ( j  E ) **2 ). This is also the statistical information in the observation.
For dichotomies, these simplify to E = P1 and V = P1 * P0 = P1*(1P1).
For each observation, there is an expectation and a model variance of the observation around that expectation.
residual = observation  expectation
Outfit meansquare = sum (residual**2 / model variance ) / (count of observations)
Infit meansquare = sum (residual**2) / sum (modeled variance)
Thus the outfit meansquare is the accumulation of squaredstandardizedresiduals divided by their count (their expectation). The infit meansquare is the accumulation of squared residuals divided by their expectation.
Outlying observations have smaller information (model variance) and so have less information than ontarget observations. If all observations have the same amount of information, the information cancels out. Then Infit meansquare = Outfit meansquare.
For dichotomous data. Two observations: Model p=0.5, observed=1. Model p=0.25, observed =1.
Outfit meansquare = sum ( (obsexp)**2 / model variance ) / (count of observations) = ((10.5)**2/(0.5*0.5) + (10.25)**2/(0.25*0.75))/2 = (1 + 3)/2 = 2
Infit meansquare = sum ( (obsexp)**2 ) / sum(model variance ) = ((10.5)**2 + (10.25)**2) /((0.5*0.5) + (0.25*0.75)) = (0.25 + 0.56)/(0.25 +0.19) = 1.84. The offtarget observation has less influence.
The WilsonHilferty cube root transformation converts the meansquare statistics to the normallydistributed zstandardized ones. For more information, please see Patel's "Handbook of the Normal Distribution" or www.rasch.org/rmt/rmt162g.htm.
Diagnosing Misfit: Noisy = Underfit. Muted = Overfit 

Classification 
INFIT 
OUTFIT 
Explanation 
Investigation 
Noisy 
Noisy 
Lack of convergence Loss of precision Anchoring 
Final values in Table 0 large? Many categories? Large logit range? Displacements reported? 

Hard Item 
Noisy 
Noisy 
Bad item 
Ambiguous or negative wording? Debatable or misleading options? 
Muted 
Muted 
Only answered by top people 
At end of test? 

Item 
Noisy 
Noisy 
Qualitatively different item Incompatible anchor value 
Different process or content? Anchor value incorrectly applied? 
? 
Biased (DIF) item 
Stratify residuals by person group? 

Muted 
Curriculum interaction 
Are there alternative curricula? 

Muted 
? 
Redundant item 
Similar items? One item answers another? Item correlated with other variable? 

Rating scale 
Noisy 
Noisy 
Extreme category overuse 
Poor category wording? Combine or omit categories? Wrong model for scale? 
Muted 
Muted 
Middle category overuse 

Person 
Noisy 
? 
Processing error Clerical error Idiosyncratic person 
Scanner failure? Form markings misaligned? Qualitatively different person? 
High Person 
? 
Noisy 
Careless Sleeping Rushing 
Unexpected wrong answers? Unexpected errors at start? Unexpected errors at end? 
Low Person 
? 
Noisy 
Guessing "Special" knowledge 
Unexpected right answers? Systematic response pattern? Content of unexpected answers? 
Muted 
? 
Plodding Caution 
Did not reach end of test? Only answered easy items? 

Person/Judge Rating 
Noisy 
Noisy 
Extreme category overuse 
Extremism? Defiance? Misunderstanding the rating scale? 
Muted 
Muted 
Middle category overuse 
Conservatism? Resistance? 

Judge Rating 
Apparent unanimity 
Collusion? Hidden constraints? 

INFIT: informationweighted meansquare, sensitive to irregular inlying patterns OUTFIT: usual unweighted meansquare, sensitive to unexpected rare extremes Muted: overfit, unmodeled dependence, redundancy, the data are too predictable Noisy: underfit, unexpected unrelated irregularities, the data are too unpredictable. 
Guessing and Carelessness
Both guessing and carelessness can cause high outfit meansquare statistics. Sometimes it is not difficult to identify which is the cause of the problem. Here is a procedure:
1) Analyze all the data: output PFILE=pf.txt IFILE=if.txt SFILE=sf.txt
2) Analyze all the data with anchor values: PAFILE=pf.txt IAFILE=if.txt SAFILE=sf.txt
2A) CUTLO= 2 this eliminates responses to very hard items, so person misfit would be due to unexpected responses to easy items
2B) CUTHI= 2 this eliminates responses to very easy items, so person misfit would be due to unexpected responses to hard items.
What is your primary concern? Statistical fit or productive measurement?
Statistical fit is like "beauty". Productive measurement is like "utility".
Statistical fit is dominated by sample size. It is like looking at the data through a microscope. The more powerful the microscope (= the bigger the sample), the more flaws we can see in each item. For the purposes of beauty, we may well scrutinize our possessions with a microscope. Is that a flaw in the diamond? Is that a crack in the crystal? There is no upper limit to the magnification we might use, so there is no limit to the strictness of the statistical criteria we might employ. The nearer to 1.0 for meansquares, the more "beautiful" the data.
In practical situations, we don't look at our possessions through a microscope. For the purposes of utility, we are only concerned about cracks, chips and flaws that will impact the usefulness of the items, and these must be reasonably obvious. In terms of meansquares, the range 0.5 to 1.5 supports productive measurement.
However, life requires compromise between beauty and utility. We want our cups and saucers to be functional, but also to look reasonably nice. So a reasonable compromise for highstakes data is meansquares in the range 0.8 to 1.2.
Fit statisticsfor item banks are awkward. They depend on the manner in which the items in the item banks are used, and also the manner in which the item difficulties are to be verified and updated. Initial values for the items often originate in conventional paperandpencil tests, but if the item bank is used to support other styles of testing, these initial values will be superseded by more relevant values. Constructing fit statistics for other styles of testing has proved challenging for theoreticians. This has forced the relaxation of the strict rules of conventional statistical analysis. This is probably why you are having difficulty finding appropriate literature.
My person sample size is 2,000. Many items have significant misfit. What shall I do?
Don't despair! Your items may not be as bad as those statistics say.
If the noise in the data is homogeneous, then the noiselevel is independent of sample size. The meansquare statistics will also be independent of sample size.
tstatistics are sensitive to the power of the statistical test (sample size). The relationship between meansquares and tstatistics is shown in the Figures above which suggests that, for practical purposes, tstatistics are underpowered for sample sizes less than 100 and overpowered for sample sizes greater than 300.
So my recommendation (not accepted by conventional statisticians) is that meansquares be used in preference to tstatistics. In my view, the standard ttests are testing the wrong hypothesis. Wrong hypothesis = "The data fit the model (perfectly)". Right hypothesis = "The data fit the model (usefully)". Unfortunately conventional statisticians are not interested in usefulness and so have not formulated ttests for it.
Alternatively, randomsample 300 from your 2,000 testtakers. Perform your ttests with this reduced sample. Confirm your findings with another randomsample of 300.
Help for Winsteps Rasch Measurement Software: www.winsteps.com. Author: John Michael Linacre
For more information, contact info@winsteps.com or use the Contact Form
Facets Rasch measurement software.
Buy for $149. & site licenses.
Freeware student/evaluation download Winsteps Rasch measurement software. Buy for $149. & site licenses. Freeware student/evaluation download 

Stateoftheart : singleuser and site licenses : free student/evaluation versions : download immediately : instructional PDFs : user forum : assistance by email : bugs fixed fast : free update eligibility : backwards compatible : money back if not satisfied Rasch, Winsteps, Facets online Tutorials 

Forum  Rasch Measurement Forum to discuss any Raschrelated topic 
Click here to add your email address to the Winsteps and Facets email list for notifications.
Click here to ask a question or make a suggestion about Winsteps and Facets software.
Coming Raschrelated Events  

Jan. 5  Feb. 2, 2018, Fri.Fri.  Online workshop: Practical Rasch Measurement  Core Topics (E. Smith, Winsteps), www.statistics.com 
Jan. 1016, 2018, Wed.Tues.  Inperson workshop: Advanced Course in Rasch Measurement Theory and the application of RUMM2030, Perth, Australia (D. Andrich), Announcement 
Jan. 1719, 2018, Wed.Fri.  Rasch Conference: Seventh International Conference on Probabilistic Models for Measurement, Matilda Bay Club, Perth, Australia, Website 
Jan. 2224, 2018, MonWed.  Inperson workshop: Rasch Measurement for Everybody en español (A. Tristan, Winsteps), San Luis Potosi, Mexico. www.ieia.com.mx 
April 1012, 2018, Tues.Thurs.  Rasch Conference: IOMW, New York, NY, www.iomw.org 
April 1317, 2018, Fri.Tues.  AERA, New York, NY, www.aera.net 
May 22  24, 2018, Tues.Thur.  EALTA 2018 preconference workshop (Introduction to Rasch measurement using WINSTEPS and FACETS, Thomas Eckes & Frank WeissMotz), https://ealta2018.testdaf.de 
May 25  June 22, 2018, Fri.Fri.  Online workshop: Practical Rasch Measurement  Core Topics (E. Smith, Winsteps), www.statistics.com 
June 27  29, 2018, Wed.Fri.  Measurement at the Crossroads: History, philosophy and sociology of measurement, Paris, France., https://measurement2018.sciencesconf.org 
June 29  July 27, 2018, Fri.Fri.  Online workshop: Practical Rasch Measurement  Further Topics (E. Smith, Winsteps), www.statistics.com 
July 25  July 27, 2018, Wed.Fri.  PacificRim Objective Measurement Symposium (PROMS), (Preconference workshops July 2324, 2018) Fudan University, Shanghai, China "Applying Rasch Measurement in Language Assessment and across the Human Sciences" www.promsociety.org 
Aug. 10  Sept. 7, 2018, Fri.Fri.  Online workshop: ManyFacet Rasch Measurement (E. Smith, Facets), www.statistics.com 
Sept. 3  6, 2018, Mon.Thurs.  IMEKO World Congress, Belfast, Northern Ireland www.imeko2018.org 
Oct. 12  Nov. 9, 2018, Fri.Fri.  Online workshop: Practical Rasch Measurement  Core Topics (E. Smith, Winsteps), www.statistics.com 
Our current URL is www.winsteps.com
Winsteps^{®} is a registered trademark
Concerned about aches, pains, youthfulness? Mike and Jenny suggest Liquid Biocell 
