惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

博客园 - xgqfrms

Dell 显示器 S2419HM 灰屏 &花屏 All In One AI Harness Engineering All In One 电脑外接显示器天梯榜 All In One How to change the speed display unit of GSP from mph to km/h using GoPro Labs All In One WHCA 白宫记者协会 All In One Pascal Editor All In One 主流犬种图解指南 All In One 泡沫喷雾 & 辣椒喷雾 All In One 如何给身份证照片添加水印 All In One GoPro MISSION 1 PRO price All In One 杭州历史天气数据 All In One Pandoc All In One GoPro MISSION 1 SERIES All In One GoPro telemetry 中的 GPS5 与 GPS9 是什么 All In One NHTSA FARS All In One How to extract raw telemetry data from GoPro videos using FFmpeg All In One Free Vercel Services All In One How to run for loop in Python REPL All In One 使用不同 AI 大模型生成一杯装满的红酒的高脚杯挑战赛 All In One CSS Custom Highlight API All In One OpenCode All In One OpenClaw 设置 cron 定时任务 All In One free MongoDB Cloud API All In One 如何在 Raspberry Pi 安装 OpenClaw All In One free cloud LLM models API All In One Claude Code Free Video Tutorials All In One 如何解决 OpenClaw 升级后导致 feishu plugin 无法使用的问题 All In One Claude Code skills & plugins All In One How to fix use the FileZilla FTP upload file error All In One
LLM Benchmark All In One
xgqfrms · 2026-03-24 · via 博客园 - xgqfrms

LLM Benchmark All In One

大语言模型基准测试

Ollama

Knowledge

MMLU-Pro 87.4 89.5 89.885.7 87.1 87.8 MMLU-Redux 95.0 95.6 95.992.8 94.5 94.9 SuperGPQA 67.9 70.6 74.067.3 69.2 70.4 C-Eval 90.5 92.2 93.4 93.7 94.0 93.0
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

Instruction Following

IFEval 94.890.9 93.5 93.4 93.9 92.6 IFBench 75.4 58.0 70.4 70.9 70.2 76.5 MultiChallenge 57.9 54.2 64.2 63.3 62.7 67.6
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

Long Context

AA-LCR 72.7 74.0 70.7 68.7 70.0 68.7 LongBench v2 54.5 64.4 68.2 60.6 61.0 63.2
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

STEM

GPQA 92.487.0 91.9 87.4 87.6 88.4 HLE 35.5 30.8 37.5 30.2 30.1 28.7 HLE-Verified¹ 43.3 38.8 48.037.6 -- 37.6
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

Reasoning

LiveCodeBench v6 87.7 84.8 90.7 85.9 85.0 83.6 HMMT Feb 25 99.492.9 97.3 98.0 95.4 94.8 HMMT Nov 25 100 93.3 93.3 94.7 91.1 92.7 IMOAnswerBench 86.3 84.0 83.3 83.9 81.8 80.9 AIME26 96.7 93.3 90.6 93.3 93.3 91.3
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

General Agent

BFCL-V4 63.177.5 72.5 67.7 68.3 72.9 TAU2-Bench 87.1 91.6 85.4 84.6 77.0 86.7 VITA-Bench 38.256.3 51.6 40.9 41.9 49.7 DeepPlanning 44.6 33.9 23.3 28.7 14.5 34.3 Tool Decathlon 43.8 43.5 36.4 18.8 27.8 38.3 MCP-Mark 57.5 42.3 53.9 33.5 29.5 46.1
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

Search Agent³

HLE w/ tool 45.5 43.4 45.8 49.8 50.2 48.3 BrowseComp 65.8 67.8 59.2 53.9 --/74.9 78.6 BrowseComp-zh 76.162.4 66.8 60.9 -- 70.3 WideSearch 76.8 76.4 68.0 57.9 72.7 74.0 Seal-0 45.0 47.7 45.5 46.9 57.4 46.9
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

Multilingualism

MMMLU 89.5 90.1 90.684.4 86.0 88.5 MMLU-ProX 83.7 85.7 87.778.5 82.3 84.7 NOVA-63 54.6 56.7 56.7 54.2 56.0 59.1 INCLUDE 87.5 86.2 90.582.3 83.3 85.6 Global PIQA 90.9 91.6 93.286.0 89.3 89.8 PolyMATH 62.5 79.0 81.6 64.7 43.1 73.3 WMT24++ 78.8 79.7 80.777.677.6 78.9 MAXIFE 88.4 79.2 87.5 84.0 72.8 88.2
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

Coding Agent

SWE-bench Verified 80.0 80.976.2 75.3 76.8 76.2 SWE-bench Multilingual 72.0 77.565.0 66.7 73.0 69.3 SecCodeBench 68.7 68.6 62.4 57.5 61.3 68.3 Terminal Bench 2 54.0 59.3 54.2 22.5 50.8 52.5
Benchmark GPT5.2 Claude 4.5 Opus Gemini-3 Pro Qwen3-Max-Thinking K2.5-1T-A32B Qwen3.5-397B-A17B

https://ollama.com/library/qwen3.5

https://ollama.com/search?c=cloud

demos

Gemini 3 Pro

Knowledge

image

Multilingualism

image

STEM

image

Claude 4.5 Opus

Coding Agent

image

GPT 5.2

Reasoning

image

free cloud LLM Models

Antigravity

目前在 Antigravity 中使用 Gemini 3 模型在公开预览阶段(Public Preview)是免费的,但有一定的限制和未来的收费预期

具体情况如下:

  1. 免费个人计划(Individual Plan):在当前的预览阶段,个人用户可以免费($0/月)使用包括 Gemini 3 ProGemini 3 Flash 在内的模型。这包含了无限制的 Tab 自动补全和命令请求功能,前提是不超过平台设定的“免费额度(Rate limits)”。
  2. 使用限制(Quotas):免费层存在一定的使用上限。如果你的使用频率非常高,可能会触发额度限制提示(Quota limits)。
  3. 未来可能会收费:目前的免费使用属于预览期的福利安排,Google 表示未来该平台预计会过渡到常规的商业计费模式(可能提供带有基础免费额度的 Freemium 模式,或者针对个人/团队的订阅计划)。
  4. 进阶付费选项:如果你需要更高的请求额度,可以通过订阅 Google AI Pro(约 $20/月)或 Google AI Ultra(约 $249.99/月)来获取更高且更充裕的模型调用权限。超出额度后还可以通过购买 AI 积分(Credits)继续使用。

总结:现在你可以免费使用它来进行日常编程,但如果重度频繁使用可能会遇到免费额度的限制。

Gemini 3 Flash Preview on Ollama’s cloud:

$ ollama run gemini-3-flash-preview:cloud

image

image

https://ollama.com/library/gemini-3-flash-preview

refs

image

https://chatgpt.com/c/69c21608-e228-8326-976c-14b7d3b0af92

©xgqfrms 2012-2021

www.cnblogs.com/xgqfrms 发布文章使用:只允许注册用户才可以访问!

原创文章,版权所有©️xgqfrms, 禁止转载 🈲️,侵权必究⚠️!