Comparative judgement for L2 writing assessment: are expert judges necessary?

Thwaites, Peter;Vandeweerd, Nathan;Paquot, Magali
(2023) BAAHE 2023 — Location: Leuven, Belgium (17.November.2023)

Files

No attached file found for this publication.

Details

Authors
Abstract
Comparative Judgement (CJ) is a method of assessment in which judges decide which of two student productions is “better”. CJ has been tested in a wide range of contexts and has generally been demonstrated to provide reliable and valid evaluations, in particular of complex constructs which are difficult to assess using traditional methods (e.g. Jones et al., 2019), while additionally providing benefits in terms of efficiency. More recently, several studies have suggested that these benefits might also apply to the use of CJ for L2 writing assessment. For example, a study by Sims et al. (2020) found that CJ provided near-identical levels of reliability and validity to the rubric-based assessment of a set of L2 argumentative essays, while also increasing efficiency by almost one minute per essay. Among the complexities around CJ is the role of judge expertise. Sims et al.’s study made use of both novice judges (undergraduate TESOL students) and experts (experienced teachers and essay graders), and reported that the two sets of judges returned similarly reliable and valid judgements. Findings such as this raise the possibility that novice judges, by virtue of their greater availability and lower cost relative to experts, might be seen as a tempting way for users of CJ to speed up their studies and cut their costs. However, since the wider research on novice judges remains inconclusive (Bartholomew & Jones, 2022), more studies are needed to ensure that this attractive short-cut does not become a kind of complicity – an agreement to compromise validity in favour of economy. In this presentation, we explore the efficacy of novice judges by partially replicating a study by Paquot et al. (2022) in which experts (members of the linguistic community recruited through crowdsourcing) were asked to provide comparative judgements of 50 L2 English argumentative essays from the ETS corpus. Paquot et al. found that these judges were able to evaluate the ETS texts to a very high level of reliability, and with significant overlap with prior rubric-based evaluations. In our replication, we recruited a diverse group of participants, largely lacking in experience and expertise relevant to the assessment of L2 writing, through the Prolific crowdsourcing platform. We then asked these judges to provide comparisons of the same set of texts used in Paquot et al.’s study. Contrasting the results of the two grading sessions in terms of their reliability and overlap, we found that the novice judges provided high quality assessments, with a very high level of reliability and strong correlations both with the judgements made by Paquot et al.’s experts, and with prior rubric-based scores. The results of the study will be of primary interest to the field of learner corpus research, where crowdsourced CJ offers potential for reliable text assessment (Paquot et al., 2022). More generally, by exploring the role of judge expertise they contribute to discussions regarding the feasibility of CJ for second language academic writing assessment.
Affiliations

Citations

Thwaites, P., Vandeweerd, N., & Paquot, M. (2023). Comparative judgement for L2 writing assessment: are expert judges necessary? BAAHE 2023, Leuven, Belgium. https://hdl.handle.net/2078.5/269132