























MTCMB是由认证中医专家协作开发的多任务基准框架,包含12个子数据集,覆盖知识问答、语言理解、诊断推理、处方生成和安全评估五大类别。基准整合真实病例记录、国家执业考试和经典文献,为中医领域大语言模型提供真实全面的测试环境。所有数据集、代码和评估工具已在GitHub公开。
现有LLM在中医基础知识问答中表现良好,但临床推理、处方规划和安全合规能力存在明显短板。基础知识和高级应用之间的能力差距突出,说明需要领域对齐的基准来引导开发更可信的医疗AI系统。MTCMB填补了现有基准缺乏领域特定任务和临床真实性的空白。
中医系统因其隐含推理、文本形式多样且缺乏标准化,给计算建模和评估带来重大挑战。LLM虽在通用医学领域展现潜力,但在中医领域的系统评估仍不完善。MTCMB通过多任务设计和专家参与,解决了评估维度单一的问题,为模型能力进阶提供具体方向。
Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: https://github.com/Wayyuanyuan/MTCMB.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。