
























Authors:Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu, Xiaoxiao Ge, Liang Liu, Lu Gan
Abstract:Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.22047 [cs.AI] |
| (or arXiv:2605.22047v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.22047 arXiv-issued DOI via DataCite (pending registration) |
From: Gengchen Ma [view email]
[v1]
Thu, 21 May 2026 06:34:50 UTC (3,427 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。