惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - Franky
N
Netflix TechBlog - Medium
Google Online Security Blog
Google Online Security Blog
月光博客
月光博客
量子位
酷 壳 – CoolShell
酷 壳 – CoolShell
V
V2EX
腾讯CDC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 聂微东
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
M
MIT News - Artificial intelligence
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Hugging Face - Blog
Hugging Face - Blog
博客园 - 【当耐特】
Apple Machine Learning Research
Apple Machine Learning Research
aimingoo的专栏
aimingoo的专栏
博客园 - 三生石上(FineUI控件)
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
MongoDB | Blog
MongoDB | Blog
H
Help Net Security
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
F
Full Disclosure
G
Google Developers Blog
罗磊的独立博客
Jina AI
Jina AI
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Y
Y Combinator Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
J
Java Code Geeks
A
About on SuperTechFans
IT之家
IT之家
大猫的无限游戏
大猫的无限游戏
S
SegmentFault 最新的问题
有赞技术团队
有赞技术团队
GbyAI
GbyAI
雷峰网
雷峰网
T
The Blog of Author Tim Ferriss
The Register - Security
The Register - Security
U
Unit 42
D
Docker
Martin Fowler
Martin Fowler
L
LINUX DO - 热门话题
NISL@THU
NISL@THU
阮一峰的网络日志
阮一峰的网络日志
C
Cybersecurity and Infrastructure Security Agency CISA
博客园_首页
Google DeepMind News
Google DeepMind News

Hacker News - Newest: "AI"

AI can't read an investor deck AI as an attorney? Student uses ChatGPT, Gemini to sue UW over alleged racial discrimination Hacking MCP Servers in AI Systems – The Rug Pull: Tool Changes After Approval GitHub - MeepCastana/KubeezCut: Free Web based video editor GitHub - GenAI-Gurus/awesome-eu-ai-act: Curated tools, official sources, OSS, templates, and guides for EU AI Act compliance. Can AI judge journalism? A Thiel-backed startup says yes, even if it risks chilling whistleblowers Coming soon: 10 Things That Matter in AI Right Now DARPA built an AI to fact-check enemy weapons claims IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures What explains heterogeneity in AI adoption? When AI Meets Muscle: Context-Aware Electrical Stimulation Promises a New Way to Guide Human Movements - Department of Computer Science AI Changed How We Build. It Did Not Change What Matters. Linux rules on using AI-generated code - Copilot is OK, but humans must take 'full responsibility for the… Meta spins up AI version of Mark Zuckerberg to engage with employees Code Mode: Let Your AI Write Programs, Not Just Call Tools | TanStack Blog GitHub - Delavalom/graft: Go framework for building AI agents. Type-safe tools, multi-provider (OpenAI, Anthropic, Gemini, Bedrock), zero vendor SDKs. India's TCS tops estimates, says new AI models did not dent services demand Gen Z's fading AI hype Strong feeling: we are in a folded AI reality GitHub - machinarii/total-recall-catalog: A reference catalog of latest knowledge retrieval, memory & RAG systems GitHub - mensfeld/code-on-incus: Give each AI agent its own isolated machine with root, Docker, and systemd. Active defense detects and stops threats automatically.. Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks GitHub - immartian/bellamem: Persistent belief-graph memory for AI agents. Retrieves decisive context by importance — not recency, not RAG, not /compact. recursive-mode: The Repo-Native Operating System for AI Engineering After the attack on Sam Altman's home, will AI CEO's go on the offensive? The biggest advance in AI since the LLM Opus 4.6 vs GPT 5.4 One Prompt Unity World Generation Test “AI polls” are fake polls Client Challenge Can AI be a 'child of God'? Inside Anthropic's meeting with Christian leaders How to Switch AI Chatbots and Why You Might Want To GitHub - MattMessinger1/agentic_refund_guardrail: Safe refund policy layer for AI agents — Python + TypeScript. Same behavior, shared tests. Adam/papers/emergent_values_whitepaper.md at master · strangeadvancedmarketing/Adam Ask HN: How do you stop playing 20 questions with your AI coding tools How far can automation and AI support psychotherapy? - @theU GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits A Mac Studio for Local AI — 6 Months Later A History of the Early Years of AI at the University of Edinburgh Why AI Coding Tools Still Feel Stuck on Localhost MSN AI Datacenters Are Becoming Strategic Targets twitter.com Penn Researchers Use AI to Surface Unreported GLP-1 Side Effects in Reddit Posts Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 AI models are terrible at betting on soccer—especially xAI Grok GitHub - xialeistudio/echoic GitHub - HimashaHerath/github-dev-wrapped: AI-powered weekly GitHub activity reports deployed to GitHub Pages GitHub - alejandrobalderas/claude-code-from-source: Architecture, patterns & internals of Anthropic's AI coding agent — reverse-engineered from source maps AI and Tech brief: Ireland ascendant GitHub - Titovilal/context0: Context0 - Never Surrender Training for a Marathon with an AI Coach: What Worked and What Didn't Cyber Pulse: Agentic Intel - Apps on Google Play I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs Gen Z workers are so fearful AI will take their job they’re intentionally sabotaging their company’s AI rollout | Fortune How AI Is Reimagining the Game of Golf–For Both Players and Courses GitHub - nattergabriel/reseed: A CLI tool for managing and distributing agent skills across projects Is SVG the final frontier? My AI workflow evolved from prompts to a near-autonomous workflow MLSharp Help - 3DGS Viewer & Generator I put my cognitive field based AI's runtime on GitHub Is Numble the first AI-proof game? A3: Kubernetes for autonomous AI agent fleets | Emergent Principles Deepali Vyas ("The Elite Recruiter") GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Unionized ProPublica staff are on strike over AI, layoffs, and wages Unleashing the Advantage of Quantum AI We're heading for an AI-fueled 'dementia crisis,' brain scientist warns The AI-Assisted Breach of Mexico's Government Infrastructure [pdf] GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. MSN GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way What leaked "SteamGPT" files could mean for the PC gaming platform's use of AI AI is the boss at this retail store. What could go wrong? GitHub - Wuzu11517/agentic-proxy: Local proxy meant to help reduce With Drones, Geophysics and ArtificiaI Intelligence, Researchers Prepare to Do Battle Against Land Mines A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - inevolin/resume-cli: Hit Claude usage limits? Resume any AI coding session elsewhere. Switch tools at zero friction. GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript This Startup Wants You to Pay Up to Talk With AI Versions of Human Experts Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949 WordPress 7.0: The Good, the AI, and the Still Missing AI on the couch: Anthropic gives Claude 20 hours of psychiatry AI Agents Know About Supabase. They Don't Always Use It Right. The history and future of AI at Google, with Sundar Pichai Inside an AI‑enabled device code phishing campaign How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines AI for Systems: Using LLMs to Optimize Database Query Execution Forecasting the Economic Effects of AI Introducing Tinker: Play with AI, bring your ideas to life AI sheds light on an ancient gaming mystery People really hate AI but not as much as Iran—or Democrats | Fortune What is an AI Product Engineer? Phoebe Gates wants her $185 million AI startup to succeed with 'no ties to my privilege or my last name': 'I have a chip on my shoulder' | Fortune
General-purpose large language models outperform specialized clinical AI tools on medical benchmarks
Oermann, Eric Karl · 2026-06-14 · via Hacker News - Newest: "AI"
  • Brief Communication
  • Open access
  • Published:

Nature Medicine (2026) Cite this article

Subjects

Abstract

Specialized clinical artificial intelligence (AI) tools are entering medical practice despite scarce independent evaluation. We quantitatively evaluate two clinical AI tools, OpenEvidence and UpToDate Expert AI, built on large language models (LLMs) against three frontier LLMs: GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. Our evaluation has three stages: (1) 500 MedQA questions testing medical knowledge, (2) 500 HealthBench items measuring alignment with clinicians and (3) the real clinical queries (RCQ) benchmark, built from 100 de-identified queries from physicians to a general-purpose language model in a live clinical environment. For the RCQ benchmark, 12 US clinicians performed randomized, blinded review of model outputs, producing 1,800 model–question annotations. Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ. These findings highlight the need for independent, real-world evaluation of AI tools before they enter clinical settings.

Main

Specialized clinical artificial intelligence (AI) tools are entering medical practice at scale1,2. These proprietary large language model (LLM)-based tools promise superior clinical performance to general-purpose frontier LLMs as a result of domain-specific training or retrieval-augmented generation (RAG)3. Yet, their architectures, base models and training pipelines are not public. Clinicians and health systems must therefore assess their value and safety without independent evidence. Conversely, large training corpora and extensive alignment of frontier LLMs may enable them to challenge clinical AI tools without domain-specific modification. We test this hypothesis by comparing clinical AI tools (OpenEvidence1 and UpToDate Expert AI2) to leading general-purpose LLMs (OpenAI GPT-5.2, Google Gemini 3.1 Pro Preview and Anthropic Claude Opus 4.6). Later, we include auto-enabled Google Search AI Overview as a real-world control frequently encountered by physicians.

Our evaluation (Fig. 1) has three stages: (1) 500 US Medical Licensing Examination-style MedQA4 questions assessing medical knowledge, (2) 500 HealthBench5 items evaluating agreement with expert clinicians and (3) 100 real clinical queries (RCQ) drawn from physician LLM queries during live clinical deployment. The RCQ stage underwent randomized, blinded review by 12 US clinicians, producing 1,800 model–question annotations. The combined analysis spans multiple-choice reasoning, expert clinical judgment and everyday clinician use.

Fig. 1: Clinical LLM evaluation pipeline.
Fig. 1: Clinical LLM evaluation pipeline.

The alternative text for this image may have been generated using AI.

Full size image

Schematic overview of the comparative analysis between frontier, clinical-specific and search-embedded AI models. The framework integrates automated scoring (MedQA and HealthBench) with high-fidelity blinded and randomized clinician reviews (RCQ) to assess model performance across accuracy, safety and reliability metrics. N, no; USMLE, US Medical Licensing Examination; Y, yes. Blindfold/blinded icon from Tailwind Labs under an MIT license (©Tailwind Labs); other icons from React Icons under a CC BY 4.0 (brain, stethoscope, book, bar chart, doctor, justice scale, chat, shield) or Apache 2.0 (browser, API badge).

General-purpose LLMs outperformed clinical AI tools on the MedQA questions (Fig. 2a and Extended Data Fig. 1a,b). Among frontier LLMs, Gemini achieved the highest accuracy at 97.4% (95% confidence interval (CI) 95.6%–98.5%), followed by GPT at 94.2% (91.8%–95.9%) and Claude at 90.2% (87.3%–92.5%). Clinical tools scored lower, with OpenEvidence achieving an accuracy of 89.6% (86.6%–92.0%) and UpToDate achieving 88.4% (85.3%–90.9%). Gemini outperformed all other models (McNemar P < 1 × 10−4 versus OpenEvidence, UpToDate and Claude; P = 0.02 versus GPT). GPT outperformed OpenEvidence (P = 0.008), UpToDate (P = 0.0004) and Claude (P = 0.04).

Fig. 2: Comparative evaluation of AI systems across benchmark performance and real-world clinical use.
Fig. 2: Comparative evaluation of AI systems across benchmark performance and real-world clinical use.

The alternative text for this image may have been generated using AI.

Full size image

a, MedQA accuracy (n = 500 questions per model). b, HealthBench score (n = 500 questions per model). c, RCQ mean aggregate clinician rating (1–4-point scale). Each of n = 100 clinical questions was answered by all 6 models and independently rated by 3 of 12 clinicians; 32 question–model pairs identified as refusals were excluded, yielding n = 98 (Gemini), 97 (GPT-5.2), 99 (Claude), 99 (OpenEvidence), 81 (UpToDate) and 94 (Google AI) non-refusal items per model. d, RCQ scores disaggregated by evaluation dimension; sample sizes as in c. e, Refusal rate (n = 100 questions per model). f,g, Proportion flagged for harmful content (f) or hallucination by majority vote (≥2 of 3 raters) (g); sample sizes as in c. The unit of observation is the individual question (a,b,e) or the question–model–rater evaluation (c,d,f,g); each question represents an independent observation and individual raters serve as independent evaluators (biological replicates). No technical replicates were used. No control group was designated, as the study compares all models against one another. In ad, data are presented as mean ± 95% CI (Wilson score interval in a; mean ± 1.96 × s.e.m. in bd). Italic letters denote compact letter display (CLD) significance groups. Models sharing a letter do not differ significantly (adjusted P > 0.05). CLD groups were derived from two-sided pairwise McNemar tests with Holm–Bonferroni correction in a, two-sided Wilcoxon signed-rank tests with Holm–Bonferroni correction in b and two-sided Nemenyi post hoc tests following Friedman test (χ2 = 93.65, d.f. = 5, P = 1.15 × 10−18) in c. Higher values indicate better performance in ad; lower values indicate better performance in eg.

HealthBench (Fig. 2b) was graded by a panel of LLM judges to mitigate single-model bias. Scores reflect the proportion of rubric points achieved, scaled 0–100. GPT scored highest at 88.0 (95% CI 85.9–90.1), followed by Gemini at 79.3 (76.6–81.9) and Claude at 77.0 (74.2–79.9); both clinical tools scored lower (OpenEvidence scoring 62.6 (59.3–65.9) and UpToDate scoring 61.3 (58.0–64.6)). GPT outperformed all other models (Wilcoxon P < 10−9), and the two clinical tools did not differ (P = 0.6). In theme-level analysis (Extended Data Fig. 1c,d), GPT ranked first or tied for first in all seven categories, while OpenEvidence and UpToDate ranked lowest or tied for lowest in all seven categories, with differences from GPT significant in six of the categories (P ≤ 0.004; exception: responding under uncertainty, P = 1.00).

To develop the RCQ benchmark, we sampled 100 anonymous clinician queries to the NYU Langone Health Insurance Portability and Accountability Act-compliant GPT instance. Twelve blinded clinicians scored six models’ responses across four dimensions (clinical correctness, completeness, safety/harm avoidance and clarity) on a 1–4-point scale (Extended Data Fig. 2). For each response, three raters were then randomly assigned to evaluate them. We included Google Search AI Overview in the RCQ evaluation because it is routinely encountered by clinicians. After excluding 32 refusals, 568 responses remained.

The six models differed significantly (Friedman P < 10−9), with two performance tiers emerging (Fig. 2c). Frontier LLMs formed the first: Gemini (mean aggregate 3.62; 95% CI 3.56–3.68), GPT (3.54; 3.47–3.61) and Claude (3.52; 3.44–3.59), with no significant differences between them. Clinical tools and Google AI Overview followed: OpenEvidence (3.24; 3.17–3.32), UpToDate AI (3.17; 3.09–3.25) and Google AI Overview (3.27; 3.18–3.35), also without significant differences. All nine significant pairwise comparisons were between tiers (rank-biserial r = 0.5–0.9), meaning frontier models outperformed clinical tools on most individual questions, not just on average. After adjusting for rater leniency, clinical AI tools (including Google AI) had 49–87% lower odds of receiving a higher rating than Gemini (odds ratio 0.13–0.51; all P < 0.0001). In a sensitivity linear mixed model, this corresponded to 0.36–0.44 points lower on the 1–4-point scale (all P < 0.0001). Google AI Overview scored as well or better than OpenEvidence and UpToDate AI across all dimensions (Extended Data Fig. 3).

The tier structure held across all four dimensions (Fig. 2d). Models differed most on clarity (Kendall’s W = 0.292) and least on clinical correctness (W = 0.141). OpenEvidence scored lowest on clarity (mean 2.84), suggesting its weakness was communication, not knowledge. Qualitatively, incomplete clinical content, safety-critical omissions and disorganized responses were common, particularly for OpenEvidence and Google AI Overview (Extended Data Table 1). UpToDate AI refused 19% of queries (Fig. 2e), more than all other models (1–3%; P < 0.01) except Google AI Overview (6%; P = 0.10). Safety outcomes (Fig. 2f,g) did not differ across models: none of the models produced more harmful content (Cochran’s Q = 4.00, P = 0.55) or hallucinations (Q = 5.00, P = 0.42) than any of the others. All 12 clinicians ranked the models similarly (Kendall’s W = 0.651, P = 2.3 × 10−7), placing frontier LLMs above clinical tools (Extended Data Fig. 4).

This study is an independent, quantitative comparison of clinical AI tools against frontier LLMs using real-world physician queries from the course of care. Clinical AI tools lagged behind frontier models on every evaluation: knowledge, expert alignment and real-world clinical use across multiple dimensions. Google AI Overview, an auto-enabled search feature, matched clinical AI tools in this benchmark.

As the architecture of proprietary clinical AI tools is inaccessible, it is impossible to definitively assess a mechanistic understanding for their underperformance against general-purpose models. Evidence shows that RAG, which is likely employed by both OpenEvidence1 and UpToDate Expert AI2, may actually negatively affect model performance when irrelevant material is retrieved or poorly integrated by the base model6,7,8. Frontier LLMs may simply be better at the knowledge retrieval and reasoning that characterize most medical questions9. They also benefit from faster iteration cycles, larger training corpora and greater alignment than specialist systems. The observed advantages of frontier general-purpose models may reflect the accelerated development and investment in these systems. Should scaling returns diminish, the relative value of domain-specific tuning, curated retrieval and clinician-in-the-loop optimization may increase. Our results should therefore be interpreted as a snapshot of a rapidly evolving landscape rather than a permanent ordering of approaches. In particular, deeply subspecialized medical tasks may favor more sophisticated, domain-specific adaptation9,10,11.

This study has several limitations. Clinical tools lack public application programming interfaces (APIs), so they were queried through browser interfaces, which limited sample size and may have introduced differences in hidden prompts, retrieval behavior and output formatting. Standardized benchmarks have known issues such as data leakage7; models may have been exposed to MedQA or HealthBench during training, though our RCQ benchmark is free from this contamination. HealthBench is an OpenAI-developed benchmark that relies on a small number of physicians for each rubric, and public documentation provides limited detail on its construction and evaluation5. Evaluation of OpenAI models, including the highest-scoring model on HealthBench, GPT-5.2, may be influenced by potential benchmark–developer overlap, including potential similarities in training data, optimization objectives or rubric design. Grading bias is also possible, as frontier models served as both evaluated systems and judges, although we used a multimodel panel to mitigate this effect. Accordingly, we view the blinded clinician evaluation on the RCQ benchmark as the primary evidence in this study, while HealthBench should be interpreted as supplementary.

More broadly, industry-created benchmarks may systematically favor the systems developed by their creators, reinforcing the need for independently constructed evaluation instruments. The RCQ benchmark partially addresses this concern: it is derived from real clinical queries, evaluated by blinded clinicians and free from training-set contamination. Additionally, recently proposed safety-focused evaluations of LLM medical recommendations such as the NOHARM12 framework suggest that knowledge and communication benchmarks may not fully capture clinical risk. Related work also points to health-system-grounded evaluation frameworks, such as institution-specific operational tasks and prediction settings embedded in local clinical workflows, as an important complement to public, industry-authored benchmarks, because they may better capture whether a model is clinically useful in a given care environment13,14.

Finally, our evaluation did not assess response latency or citation quality. These factors are important for real-world clinical deployment and workflow integration, and may differ substantially between API-accessed frontier models and subscription-based clinical tools (Extended Data Table 2). Future work should systematically compare these practical dimensions alongside accuracy and safety.

Clinical AI tools may carry institutional legitimacy and are likely safe for routine use, but our results show that they are not superior to frontier models on knowledge, communication or clinical alignment. The superior performance of frontier models in our study suggests that scale, alignment and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency for particular tasks, a finding with implications for procurement, reimbursement and regulatory oversight. The path forward may ultimately lie with hospital-specific LLMs that leverage institutional data13,14 to mitigate external harm15, along with careful use of frontier models for less-sensitive tasks16. As generative LLMs become integrated into healthcare at the enterprise, individual clinician and consumer levels, there is an increasing need for rigorous, independent evaluation on real-world tasks.

Methods

Study approval and benchmark construction

This study was approved by the NYU Langone Institutional Review Board (i23-00510). We randomly sampled (seed = 62) 500 US Medical Licensing Examination-style questions from MedQA4 and 500 single-turn prompts from HealthBench5. The 1,000 items were used to evaluate three frontier LLMs via API: GPT-5.2 (accessed February 2026, GPT-5.2-2025-12-11), Gemini 3.1 Pro Preview (accessed February 2026) and Claude Opus 4.6 (February 2026). Two clinical tools, OpenEvidence (accessed September 2025 and February 2026) and UpToDate Expert AI (accessed November 2025 and February 2026), were queried manually through browser interfaces.

Generation parameters

Frontier model generations were conducted using fixed, deterministic parameters. Search tools were enabled for all runs. The temperature was set to 0.0 to eliminate sampling variability, and a fixed generation seed of 62 was used. Although matching response length or format across systems would reduce one source of variability, we opted against this approach because output structure, verbosity and formatting are integral to each system’s clinical interface and directly influence dimensions such as clarity; normalizing these features would obscure real usability differences that clinicians encounter in practice.

MedQA scoring

GPT-4.1 (gpt-4.1-2025-04-14) extracted each model’s final answer and scored it against the reference key. Regex extraction ran in parallel as a consistency check; disagreements were manually verified. Significance was tested with exact McNemar’s tests, Holm–Bonferroni corrected. Results are reported as accuracies with Wilson’s 95% CI.

HealthBench scoring

Responses were graded on the proportion of rubric points achieved across five axes: accuracy, completeness, communication quality, context awareness and instruction following. Proportion of axes met are reported with Wilson’s 95% CI. Responses were also grouped into seven themes: emergency referrals, context seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty and response depth. Grading used panel-majority voting across three judges (Claude Opus 4.6, Gemini 3.1 Pro Preview and GPT-5.2); the judge prompt is provided in Extended Data Fig. 5 (ref. 5). Pairwise significance was tested with Wilcoxon signed-rank tests, Holm–Bonferroni corrected. Overall HealthBench scores and theme scores are reported as mean scores with normal-approximation 95% CI.

Real clinical queries

We sampled 100 de-identified queries from the NYU Langone Health Insurance Portability and Accountability Act-compliant GPT instance. Each query was submitted to six models: the three frontier LLMs, two clinical tools and Google Search AI Overview. Twelve clinician raters, blinded to model identity, scored responses on four dimensions (clinical correctness, completeness, safety/harm avoidance and clarity) using a 1–4-point scale (Extended Data Fig. 2), with binary flags for harmful content and hallucination. Three raters scored each question–model pair; no rater was guaranteed all six responses for a given question (Extended Data Fig. 6). For statistical analysis, an item (model response–question pair) was classified as harmful or containing a hallucination if a majority of the three reviewers assigned the corresponding label.

RCQ statistical analysis

Refusals were flagged and manually verified; if any rater flagged a question–model pair as a refusal, all ratings for that pair were excluded, yielding 1,704 ratings across 568 items (32 refusals discarded). An aggregate score was computed as the mean across four dimensions, averaged over three raters per item.

Inter-rater reliability was assessed with Krippendorff’s alpha (ordinal). Item-level agreement was fair (α = 0.10–0.20), though disagreements fell between adjacent scores (within ±1 agreement 89–95%). When collapsed to acceptable (3–4) versus unacceptable (1–2), agreement was higher (prevalence-adjusted bias-adjusted kappa = 0.55–0.83). Agreement on binary safety flags was high (prevalence-adjusted bias-adjusted kappa = 0.86 for harm, 0.95 for hallucination).

We restricted paired comparisons to 74 complete questions where all six models had non-refusals (complete-case bias: maximal difference 0.034 points). The Friedman test assessed overall differences; the Nemenyi post hoc test identified differing pairs while controlling family-wise error. Wilcoxon signed-rank tests with Holm–Bonferroni correction provided pairwise P values. Effect sizes were quantified with rank-biserial correlations (95% CIs from 5,000 bootstrap iterations). Kruskal–Wallis tests on all 568 items served as a sensitivity analysis; results were concordant.

Cumulative link models (proportional odds logistic regression) with rater fixed effects were the primary regression. Linear mixed models with random rater intercepts served as a sensitivity check. Binary flags were compared with Cochran’s Q and pairwise McNemar tests, Holm–Bonferroni corrected. Refusal rates were compared with Fisher’s exact test. All tests were two-sided at α = 0.05.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The MedQA (https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options) and HealthBench (https://huggingface.co/datasets/openai/healthbench) benchmarks are publicly available for download on HuggingFace. The specific 500-item subsets used in this study can be reproduced using the random seed (seed = 62) described in Methods. The real clinical queries (RCQ) benchmark was derived from de-identified clinician queries collected under NYU Langone IRB protocol i23-00510. Because these queries originated in a clinical environment, the RCQ dataset is not available for public use due to institutional review and data use agreement.

Code availability

The code supporting this study is publicly available at https://github.com/nyuolab/clinical-llm-benchmarks.

References

  1. OpenEvidence, the fastest-growing application for physicians in history, announces 210 million round at 3.5 billion valuation. Cision PR Newswire https://www.prnewswire.com/news-releases/openevidence-the-fastest-growing-application-for-physicians-in-history-announces-210-million-round-at-3-5-billion-valuation-302505806.html (2025).

  2. HLTH 2025: Wolters Kluwer showcases UpToDate Expert AI and workflow innovations. Wolterskluwer https://www.wolterskluwer.com/en/news/uptodate-expert-ai-workflow-hlth-2025 (2025).

  3. Amugongo, L. M., Mascheroni, P., Brooks, S., Doering, S. & Seidel, J. Retrieval augmented generation for large language models in healthcare: a systematic review. PLOS Digit. Health 4, e0000877 (2025).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. (Basel) 11, 6421 (2021).

    Article  CAS  Google Scholar 

  5. Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. Preprint at https://doi.org/10.48550/arXiv.2505.08775 (2025).

  6. Vishwanath, K. et al. Medical large language models are easily distracted. Preprint at https://doi.org/10.48550/arXiv.2504.01201 (2025).

  7. Vishwanath, K., Stryker, J., Alyakin, A., Alber, D. A. & Oermann, E. K. MedMobile: a mobile-sized language model with clinical capabilities. BMJ Digit. Health AI 1, e000068 (2025).

    Article  Google Scholar 

  8. Wu, E., Wu, K. & Zou, J. ClashEval: quantifying the tug-of-war between an LLM’s internal prior and external evidence. In Advances in Neural Information Processing Systems 37 33402–33422 (Neural Information Processing Systems Foundation, 2024).

  9. Alyakin, A. et al. CNS-Obsidian: a neurosurgical vision-language model built from scientific publications. Neurosurgery https://doi.org/10.1227/neu.0000000000004070 (2026).

  10. O’Sullivan, J. W. et al. A large language model for complex cardiology care. Nat. Med. 32, 616–623 (2026).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Nori, H. et al. Sequential diagnosis with language models. Preprint at https://doi.org/10.48550/arXiv.2506.22405 (2025).

  12. Wu, D. et al. First, do NOHARM: towards clinically safe large language models. Preprint at https://doi.org/10.48550/arXiv.2512.01241 (2025).

  13. Jiang, L. Y. et al. Generalist foundation models are not clinical enough for hospital operations. Preprint at https://doi.org/10.48550/arXiv.2511.13703 (2025).

  14. Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Alber, D. A. et al. Medical large language models are vulnerable to data-poisoning attacks. Nat. Med. 31, 618–626 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Malhotra, K. et al. Health system-wide access to generative artificial intelligence: the New York University Langone Health experience. J. Am. Med. Inform. Assoc. 32, 268–274 (2025).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We acknowledge N. Mherabi and D. Bar-Sagi for their support of medical AI research at NYU Langone. We thank M. Constantino and the NYULH High-Performance Computing (HPC) Team for computing resources essential to our work.

Funding

E.K.O. is supported by the National Cancer Institute’s Early-Stage Surgeon Scientist Program (3P30CA016087-41S1) and the W.M. Keck Foundation. This work was supported by the Institute for Information & Communications Technology Planning and Evaluation (IITP) grant funded by the Ministry of Science and ICT (MSIT) of the Republic of Korea government (No. RS-2019-II190075 Artificial Intelligence Graduate School Program (KAIST); No. RS-2024-00509279, Global AI Frontier Lab). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

  1. Department of Neurological Surgery, NYU Langone Health, New York, NY, USA

    Krithik Vishwanath, Anton Alyakin, Sean N. Neifert, Cordelia Orillac, Nataniel J. Mandelberg, Hammad A. Khan, Jin Vivian Lee & Eric Karl Oermann

  2. Department of Aerospace Engineering & Engineering Mechanics, The University of Texas at Austin, Austin, TX, USA

    Krithik Vishwanath

  3. Department of Mathematics, The University of Texas at Austin, Austin, TX, USA

    Krithik Vishwanath

  4. Department of Neurosurgery, Washington University School of Medicine, St. Louis, MO, USA

    Anton Alyakin & Jin Vivian Lee

  5. Global AI Frontier Lab, New York University, New York, NY, USA

    Anton Alyakin, Jin Vivian Lee & Eric Karl Oermann

  6. Department of Biomedical Engineering, The University of Texas at Austin, Austin, TX, USA

    Mrigayu Ghosh

  7. Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA

    Mrigayu Ghosh

  8. Department of Cardiothoracic Surgery, NYU Langone Health, New York, NY, USA

    Ali Hage

  9. Department of Orthopedic Surgery, NYU Langone Health, New York, NY, USA

    Jie J. Yao

  10. Department of Medicine, NYU Langone Health, New York, NY, USA

    William Robert Small & Yindalon Aphinyanaphongs

  11. New York University (NYU) Grossman School of Medicine, New York, NY, USA

    William Robert Small & Daniel Alexander Alber

  12. Department of MCIT Health Informatics, NYU Langone Health, New York, NY, USA

    William Robert Small & Yindalon Aphinyanaphongs

  13. Division of Dermatology, NYU Langone Long Island, Mineola, NY, USA

    Aakaash Varma

  14. Department of Medicine, NYU Langone Long Island, Mineola, NY, USA

    Aakaash Varma

  15. Department of Surgery, NYU Langone Health, New York, NY, USA

    D. Brock Hewitt

  16. Department of Population Health, NYU Langone Health, New York, NY, USA

    Yindalon Aphinyanaphongs

  17. Department of Radiology, NYU Langone Health, New York, NY, USA

    Eric Karl Oermann

  18. Center for Data Science, New York University, New York, NY, USA

    Eric Karl Oermann

  19. Neuroscience Institute, NYU Langone Health, New York, NY, USA

    Eric Karl Oermann

Authors

  1. Krithik Vishwanath
  2. Anton Alyakin
  3. Mrigayu Ghosh
  4. Ali Hage
  5. Sean N. Neifert
  6. Cordelia Orillac
  7. Nataniel J. Mandelberg
  8. Hammad A. Khan
  9. Jin Vivian Lee
  10. Jie J. Yao
  11. William Robert Small
  12. Aakaash Varma
  13. D. Brock Hewitt
  14. Yindalon Aphinyanaphongs
  15. Daniel Alexander Alber
  16. Eric Karl Oermann

Contributions

Y.A. and E.K.O. supervised the study. K.V., A.A., D.A.A. and E.K.O. conceptualized and established the study design. K.V. designed and performed the LLM evaluations and scoring. K.V. developed the clinical evaluation platform. M.G., K.V., A.A. and D.A.A. performed the statistical analysis. A.H., S.N.N., C.O., N.J.M., H.A.K., J.V.L., J.J.Y., W.R.S., A.V., D.B.H., A.A. and D.A.A. contributed to study evaluation and data review. K.V. wrote the initial draft. K.V., M.G. and A.A. developed the figures. All authors reviewed and approved the final paper.

Corresponding authors

Correspondence to Krithik Vishwanath or Eric Karl Oermann.

Ethics declarations

Competing interests

E.K.O. reports equity in MarchAI and Artisight, spousal employment by Eikon Therapeutics, and consulting for Sofinnova Partners and Google. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Leo Anthony Celi and Stephen Gilbert for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Mattia Andreoletti, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Preliminary error taxonomy of low-scoring responses on the Real Clinical Queries benchmark

Full size table

Extended Data Table 2 Cost comparison of evaluated AI systems

Full size table

Extended Data Fig. 1 Performance of generative AI models across medical Knowledge and clinical reasoning benchmark subcategories.

(a) MedQA accuracy by organ system and (b) by competency for five models (OpenEvidence, UpToDate Expert AI, GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6). Data is presented as accuracy ± Wilson’s 95% CI. Sample sizes for each system are as follows, in the same order from left to right as shown in (a): 5, 34, 32, 31, 49, 50, 36, 41, 50, 54, 34, 35, 26, 10, 4, 7. Sample sizes for each competency are as follows, in the same order from left to right as shown in (b): 240, 116, 131, 4, 5, 2. (c) HealthBench scores by theme and (d) by grading axis. Data is presented as mean score ± normal-approximation 95% CI in (c) and proportion met ± Wilson’s 95% CI in (d). Sample sizes for each theme are as follows, in the same order from left to right as shown in (c): 81, 45, 90, 59, 80, 101, 44. Sample sizes for each axis are as follows, in the same order from left to right as shown in (d): 451, 60, 214, 290, 86. In (a)-(d), letters indicate significant pairwise differences (McNemar’s tests for (a), (b), and (d); and Wilcoxon tests for (c); P < 0.05); models sharing a letter do not differ.

Extended Data Fig. 2 RCQ clinician evaluator rubric.

Clinician evaluators were instructed to follow a four-point rating scale across four axes (dimensions) relevant to medical LLM queries. Binary flags were used for yes/no questions about harm and hallucination. Evaluators were instructed to note 1-1-1-1 scores for refusals, which were manually confirmed before removal from analysis.

Extended Data Fig. 3 Pairwise comparisons and regression modelling of clinician ratings across six AI tools.

(ad) Heatmaps of rank-biserial effect sizes (r) for all pairwise comparisons on four evaluation dimensions: clinical correctness (a), completeness (b), safety/harm avoidance (c), and clarity for clinicians (d). Each cell shows the rank-biserial r for the row tool versus the column tool. Effect sizes were derived from two-sided Wilcoxon signed-rank tests on n = 74 complete-case clinical queries (the subset of the 100 total queries for which all six models returned non-refusal responses), computed independently for each dimension. Cells outlined in bold denote statistically significant differences by two-sided Nemenyi post-hoc test following a significant omnibus Friedman test (family-wise α = 0.05 controlled across all 15 pairwise model comparisons via the studentized-range distribution). (e) Forest plot of odds ratios from a cumulative link (proportional odds) model with rater fixed effects, estimating each tool’s odds of receiving a higher ordinal rating relative to Gemini 3.1 (reference). Points indicate the odds-ratio point estimate (exp(β)) and horizontal bars the 95% Wald confidence intervals (exp(β ± 1.96 × SE)) for each of the four dimensions. Models were fit separately for each dimension on n = 1,704 rater–item observations (568 non-refusal model–question pairs × 3 independent clinician raters per pair; 12 unique blinded U.S. clinician raters). Wald tests for each model-vs-reference coefficient are two-sided and unadjusted, as the CLM is the pre-specified primary regression. Odds ratios below 1.0 indicate lower odds of a higher rating compared with the reference. (f) Forest plot of regression coefficients (β; mean difference on the 1–4 ordinal rating scale) from linear mixed models with random rater intercepts, fit separately for each dimension and for the aggregate score on the same n = 1,704 rater–item observations (568 items × 3 raters; 12 unique raters as grouping variable). Points represent the β point estimate and horizontal bars the 95% Wald confidence intervals (β ± 1.96 × SE); two-sided Wald P-values, unadjusted. Asterisks denote significance levels: *P < 0.05, **P < 0.01, ***P < 0.001.

Extended Data Fig. 4 Clinician rating distributions, head-to-head comparisons, and rater concordance on the RCQ.

Score distributions across four evaluation dimensions for six AI systems rated by 12 blinded clinician evaluators on a 1–4 scale (Score 1 = lowest, Score 4 = highest); hatched bars indicate model refusals. Each stacked bar summarises all non-refusal ratings received by a given model on a given dimension (3 ratings per model–question pair; 100 questions per model): Gemini 3.1 Pro n = 294, GPT-5.2 n = 291, Claude Opus 4.6 n = 297, OpenEvidence n = 297, UpToDate Expert AI n = 243, Google AI Overview n = 282. Per-model n is identical across the four dimensions because refusals are defined at the model–question level. Brackets denote significant pairwise differences by two-sided Nemenyi post-hoc test following a significant omnibus Friedman test on n = 74 complete-case questions (family-wise α = 0.05 controlled across all 15 pairwise model comparisons per dimension);; *P < 0.05, **P < 0.01, ***P < 0.001. Bottom left: head-to-head win rate matrix showing the proportion of queries on which each row model outscored each column model on the aggregate score across n = 74 complete-case questions (ties contributed 0.5 to the row model). Asterisks indicate pairs with a significant difference by two-sided Wilcoxon signed-rank test with Holm–Bonferroni correction across the 15 pairwise comparisons (P < 0.05). Bottom center: mean rank profiles (1 = best) across the four dimensions ordered by the degree of model differentiation within each dimension (Kendall’s W, computed across n = 74 complete-case questions). Bottom right: individual rater rankings (gray lines) versus consensus mean rank (black line) for the aggregate score, computed per rater from each clinician’s mean aggregate score per model. Inter-rater concordance on model ranking was assessed with a two-sided Friedman test across the six models, from which Kendall’s W was derived (W = χ2 / [n(k − 1)]; k = 6 models). Agreement was strong (W = 0.651, Friedman \({\chi }_{5}^{2}=39.05\), P = 2.3 × 10⁻7; n = 12 independent clinician raters).

Extended Data Fig. 5 HealthBench grader prompt.

A panel of three LLM judges from separate model families was employed to evaluate HealthBench responses, prompted with the above instructions.

Extended Data Fig. 6 Clinical evaluation platform and sample question.

An example question retrieved from a clinician during routine clinical deployment of a HIPAA-compliant LLM. Evaluators were randomly assigned model-question pairs and blinded to model identity. Three clinicians reviewed each model-question pair.

Supplementary information

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vishwanath, K., Alyakin, A., Ghosh, M. et al. General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nat Med (2026). https://doi.org/10.1038/s41591-026-04431-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41591-026-04431-5