

















Abstract:Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is highly context-dependent. Implications, challenges, and directions for future research are discussed.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2512.14561 [cs.CL] |
| (or arXiv:2512.14561v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2512.14561 arXiv-issued DOI via DataCite |
From: Hongli Li [view email]
[v1]
Tue, 16 Dec 2025 16:33:07 UTC (506 KB)
[v2]
Mon, 25 May 2026 21:01:24 UTC (806 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。