Rater discrepancy in the spanish university entrance examination

Marian Amengual Pizarro

doi:10.18172/jes.85

Authors

Marian Amengual Pizarro University of Balearic Islands

DOI:

https://doi.org/10.18172/jes.85

Abstract

This study investigated the level of inter-rater and intra-rater reliability of thirty-two raters in the evaluation of the composition subtest of the English Test (ET) in the Spanish University Entrance Examination (SUEE). Raters were asked to evaluate ten compositions holistically on two different scheduled data collection sessions (PRE and POST). The results show that although there are no significant differences between the holistic PRE and POST scores, there exists a substantial discrepancy across raters in relation to consistency and harshness of scoring. On the contrary, results reveal that, in general, raters are self-consistent in the evaluation of compositions. On the basis of these results, and given the way SUEE scores will affect the career of a candidate, it is believed that much more emphasis should be placed on establishing an acceptable level of inter-rater reliability in order to ensure that SUEE results are as fair and consistent as possible.

Downloads

Download data is not yet available.

References

Alderson, J. C., and J. Banerjee. 2001. “Language Testing and assessment (Part 1)”. Language Teaching 34: 213-36.

Amengual, M. 2003. “A Study of Different Composition Elements that Raters Respond to”. Estudios Ingleses de la Universidad Complutense 11: 53-72.

Amengual, M. 2004. “Análisis de la fiabilidad en las puntuaciones holísticas de ítems abiertos”. Published PhD thesis. CERSA: Complutense de Madrid University.

Charney, D. 1984. “The validity of using holistic scoring to evaluate writing: a critical overview”. Research in the Teaching of English 18: 65-81.

Cumming, A. 1997. “The Testing of Writing in a Second Language”. Encyclopedia of Language and Education: Language Testing and Assessment (7). Eds. C. Clapham and D. Carson. 51-63.

Cumming, A., R. Kantor, and D. Powers. 2001. Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making, and development of a preliminary analytic frame-work. TOEFL Monograph Series. Princeton, NJ: Educational Testing Service.

Fleiss, J. L. 1986. The Design and Analysis of Clinical Experiments. New York: Wiley.

Gamaroff, R. 2000. “Rater Reliability in Language Assessment: the Bug of all Bears”. System 28: 31-53.

Hamp-Lyons, L. 1990. “Second Language Writing: Assessments Issues”. Second Language Writing. Ed. B. Kroll. Cambridge: Cambridge University Press. 69-87.

Hatch, E., and A. Lazaraton. 1991. The Research Manual: Design and Statistics for Applied Linguistics. New York: Newbury House.

Herrera, H. 1999. “Is the English Test in the Spanish University Entrance Examination as Discriminating as It Should Be?” Estudios Ingleses de la Universidad Complutense 7: 89-107.

Herrera, H. 2001. “The Effect of Gender and Working Place of Raters on University Examination Scores”. Revista Española de Lingüística Aplicada 14: 161-79.

Huot, B. 1990. “The Literature of Direct Writing Assessment: Major Concerns and Prevailing Trends”. Review of Educational Research 60: 237-63.

ILTA (International Language Testing Association) 2000. Code of Ethics for ILTA. Available on-line at http://www.Dundee.ac.uk/languagestudies/Itest/ilta/ilta.html (accessed February 2003).

Kondo-Brown, K. 2002. “A FACETS analysis of rater bias in measuring Japanese second language writing performance”. Language Testing 19: 3-31.

Lumley, T. 2002. “Assessment criteria in a large-scale writing test: what do they really mean to raters?” Language Testing 19: 247-76.

Lumley, T., and T. F. McNamara. 1995. “Rater characteristics and rater bias: implications for training”. Language Testing 12: 54-71.

Milanovic, M., N. Saville, and S. Shen. 1996. “A Study of the Decision-making Behaviour of Composition Markers”. In Performance, Testing, Cognition and Assessment: Selected Papers from the 15th Language Testing Research Colloquium (LTRC), Cambridge and Arnhem. Eds. M. Milanovic and N. Saville. Cambridge: University of Cambridge Local Examinations Syndicate and Cam-bridge University Press.

Messick, S. 1992. “Validity of Test Interpretation and Use”. In Encyclopedia of Educational Research. Sixth edition. Ed. M. C. Alkin. New York: Macmillan. 1487-95.

Moss, P. 1994. “Can There Be Validity without Reliability?” Educational Researcher 23: 5-12.

Nagy, P., P. Evans, and F. Robinson. 1988. “Exploratory Analysis of Disagreement among Holistic Essay Scores”. The Alberta Journal of Educational Research 4 ( XXXIV): 355-74.

North, B., and G. Schneider. 1998. “Scaling descriptors for language proficiency scales”. Language Testing 15: 217-62.

Oller Jr, J. W. 1979. Language Test at School. London: Longman.

Shohamy, E. 1995. “Performance assessment in language testing”. Annual Review of Applied Linguistics 15: 188-211.

Tedick, D. J., ed. 1998. Proficiency-oriented language instruction and assessment: a curriculum handbook for teachers. Minneapolis, MN: Center for Advanced Research on Language Acquisition, University of Minnesota.

Vaughan, C. 1991. “Holistic Assessment: What Goes on in the Rater’s Mind?”. Assessing Language Writing in Academic Contexts. Ed. L. Hamp-Lyons. Norwood, NJ: Ablex. 111-25.

Weigle, S. C. 1994. “Effects of training on raters of ESL compositions”. Language Testing 11: 197-223.

Weigle, S. C. 1998. “Using FACETS to model rater training effects”. Language Testing 15: 263- 87.

Wigglesworth, G. 1993. “Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction”. Language Testing 10: 305-36.

Wiseman, S. 1949. “The marking of English composition in English grammar school selection”. British Journal of Education Psychology 19: 200-09.