A many-facet Rasch analysis of rater effects on an Oral English Proficiency Test
This study investigates the impact of rater severity and the stability of rater severity over time on the scores examinees receive on an Oral English Proficiency Test used to certify International Teaching Assistants at a North American university. Ratings from 434 examinees by 9 raters from August 2007 testing administration and 10 raters from August 2008 testing administration were analyzed using FACETS, a multi-faceted Rasch analysis program (Linacre, 2008). The study found that the raters demonstrated different levels of severity. However, the impact of rater severity on the test scores was small. About 4% of examinees from the two testing administrations tested out with observed averages higher than their fair averages. The majority of raters used the scale in a consistent fashion. One rater, however, demonstrated inconsistency with slightly larger infit statistic than the upper control limit of 1.2. The level of severity for most raters was not invariant but acceptable across the two sessions a year apart. Two raters showed drift of severity more than the Rasch model expects. New raters do not show more variation with respect to their severity and stability of severity than experienced raters. Slightly larger gaps between adjacent rating categories were identified, which invites the opportunity for scale revision that would allow raters to better distinguish the oral proficiency levels of the examinees. The study shows that FACETS is a useful tool for studying rater performance. The FACETS results can be used to target individual raters in follow-up rater trainings to help improve rater accuracy. In this way, examinees as stakeholders are protected against errors introduced by human raters.^
April Ginther, Purdue University.
Education, Tests and Measurements|Language, Linguistics