Computer Science > Computers and Society
arXiv:2606.18158 (cs)
[Submitted on 16 Jun 2026]
Abstract:Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.
Submission history
From: Michèle Finck [view email]
[v1]
Tue, 16 Jun 2026 16:57:12 UTC (402 KB)
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Code, Data, Media
Code, Data and Media Associated with this Article
Demos
Demos
Related Papers
Recommenders and Search Tools
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.




















