
























Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, from open-domain question answering to scientific writing, medical decision support, and legal analysis. However, their tendency to generate incorrect or fabricated content, commonly known as hallucinations, represents a critical barrier to reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited in scope, focusing primarily on general-knowledge domains while neglecting specialised fields where accuracy is paramount. To address this gap, we introduce the AIME Math Hallucination dataset, the first comprehensive benchmark specifically designed for evaluating mathematical reasoning hallucinations. Additionally, we propose SelfCheck-Eval, a LLM-agnostic, black-box hallucination detection framework applicable to both open and closed-source LLMs. Our approach leverages a novel multi-module architecture that integrates three independent detection strategies: the Semantic module, the Specialised Detection module, and the Contextual Consistency module. Our evaluation reveals systematic performance disparities across domains: existing methods perform well on biographical content but struggle significantly with mathematical reasoning, a challenge that persists across NLI fine-tuning, preference learning, and process supervision approaches. These findings highlight the fundamental limitations of current detection methods in mathematical domains and underscore the critical need for specialised, black-box compatible approaches to ensure reliable LLM deployment.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。