• André A. Rupp
  • Jodi M. Casabianca
  • Maleika Krüger
  • Stefan Keller
  • Olaf Köller
In this research report, we describe the design and empirical findings for a large‐scale study of essay writing ability with approximately 2,500 high school students in Germany and Switzerland on the basis of 2 tasks with 2 associated prompts, each from a standardized writing assessment whose scoring involved both human and automated components. For the human scoring aspect, we describe the methodology for training and monitoring human raters as well as for collecting their ratings within a customized platform. For the automated scoring aspect, we describe the methodology for training, evaluating, and selecting appropriate automated scoring models as well as correlational patterns of resulting task scores with scores from secondary measures. Analyses show that the human ratings were highly reliable and that effective prompt‐specific automated scoring models could be built with state‐of‐the‐art features and machine learning methods, which resulted in correlational patterns with secondary measures that were in line with general expectations. In closing, we discuss the methodological implications for conducting this kind of work at scale in the future.
Original languageEnglish
Title of host publicationTOEFL Research Report No. RR-86 and ETS Research Report Series No. RR-19-12
Number of pages21
Place of PublicationPrinceton, NJ
PublisherEducational Testing Service
Publication date03.2019
Publication statusPublished - 03.2019

ID: 995905