Abstract Summary/Description
This study examines the effectiveness of three rater training modalities for assessing L2 writing: 1) asynchronous video training, 2) synchronous discussion and negotiation, and 3) a hybrid approach combining both. The research explores how these methods impact interrater reliability, score consistency, and accuracy in assessing essays written by students in Japanese EFL secondary classes. Employing a mixed-methods approach, this study utilizes quantitative measures and semi-structured interviews. We also explore the effect that teaching experience has on raters’ assessments. Six Japanese secondary English teachers (experienced raters) and six Japanese university students (novice raters) assessed 40 writing samples from Japanese students using a 3-point holistic and analytic scale across three stages: pre-training (10 essays), during training (20 essays), and post-training (10 essays). Raters were divided into three training groups: asynchronous video training, synchronous discussion and negotiation, and hybrid. We utilized Krippendorff’s alpha to evaluate inter-rater reliability and repeated measures ANOVA to analyze score differences and accuracy across training methods and experience levels. Our findings suggest that hybrid training resulted in higher rater reliability, and less experienced raters achieved higher gains in scoring accuracy. Interview data revealed that both teachers and students appreciated receiving benchmark scores in advance which facilitated more practical discussions rather than simply a reliance on instinct or experience. However, feedback highlighted limitations in the video training, particularly raters’ inability to address specific concerns and interpretations of rubric descriptions. Findings highlight the need to revise rater training programs to optimize educational assessment outcomes by incorporating more personalized and interactive training methods.