惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
Security Affairs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Jina AI
Jina AI
P
Palo Alto Networks Blog
GbyAI
GbyAI
大猫的无限游戏
大猫的无限游戏
A
Arctic Wolf
Hugging Face - Blog
Hugging Face - Blog
小众软件
小众软件
Y
Y Combinator Blog
T
The Blog of Author Tim Ferriss
Blog — PlanetScale
Blog — PlanetScale
S
Schneier on Security
V
Vulnerabilities – Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
雷峰网
雷峰网
T
Tenable Blog
人人都是产品经理
人人都是产品经理
T
Tor Project blog
C
Cyber Attacks, Cyber Crime and Cyber Security
AWS News Blog
AWS News Blog
Microsoft Security Blog
Microsoft Security Blog
J
Java Code Geeks
Scott Helme
Scott Helme
SecWiki News
SecWiki News
C
CERT Recently Published Vulnerability Notes
Recorded Future
Recorded Future
I
InfoQ
Security Archives - TechRepublic
Security Archives - TechRepublic
Help Net Security
Help Net Security
Cloudbric
Cloudbric
C
Check Point Blog
Engineering at Meta
Engineering at Meta
TaoSecurity Blog
TaoSecurity Blog
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
博客园_首页
N
News and Events Feed by Topic
云风的 BLOG
云风的 BLOG
MyScale Blog
MyScale Blog
腾讯CDC
量子位
Application and Cybersecurity Blog
Application and Cybersecurity Blog
K
Kaspersky official blog
Vercel News
Vercel News
F
Full Disclosure
T
Troy Hunt's Blog
Forbes - Security
Forbes - Security
S
Security @ Cisco Blogs

HN's home page

Rainbow Query Language | Hacker News Exec into Node via Kubectl An AI native hedge fund The Seven-Action Documentation Model | Hacker News Package Manager for Kubectl Plugins Tongan Castaways | Hacker News Tech overlords plan for conscious AI to conquer the cosmos. What could go wrong? Data Breach Disclosure Lag Is Getting Worse How LLMs Work | Hacker News I Dropped PRDs for Shape Up Go Experiments Explained | Hacker News FCA's Palantir deal could expose UK financial data to Trump's US, critics fear WebXR BCI for Neural-Adaptive Avatar Control in Mixed Reality The first murder conviction via DNA analysis Tom Interviews Theo de Raadt of the OpenBSD Project (2019) [video] Show HN: Replace shell commands with bun shell typescript scripts Quay.io Is Down | Hacker News AI driven analysis of brokerage account fees in the UK Bill Gates Spent Years Crafting His Image. Now It's Cracking Using LLMs to secure source code Wi-Fi 8 in the Lab [video] The household battery revolution that could change energy bills and the world Is Python Becoming Pinyin? | Hacker News Livia – Executive Assistant | Hacker News FindMyPipe – Query Apple Find My from Linux for AI Agents Show HN: Agent skill for creating product launch videos with Remotion RecruitMyself – AI job search copilot for resumes and applications AI coding agents and the erosion of system understanding The 'Resting' Generation and South Korea's Youth Recession AMD Computex 2026: 10 Years of AM4, AM5 Support Through 2029 Docker Networking Explained | Hacker News Textbooks in Tokenland | Hacker News Key Chemistry Question Answered, No Quantum Computer Required Gifts For Retrocomputing Fans – remix yesterday's tech with a modern spin Miscellany № 49: introducing the quasiquote – Shady Characters Amazon Thinks the Future of Data Centers Is a Technical Problem It Just Solved A brief history of the UUID (2017) Flying High Unpressurized (2016) | Hacker News Five Years of Trying to Add Recursion to Lychee How British comfort food won over the French Blorp Language | Hacker News Decache – you might have the internet's lost media in your PC's cache folders Criminal Activities and Migration | Hacker News A free, open-source library of DESIGN.md files for AI-generated UIs MiniMax M3 | Hacker News People are apparently farming citations on ResearchGate – Chuniversiteit Hacker News Basketeer – a typed TS SDK for your Tesco account, with nutrition data 'Penguin' decays from CERN's Large Hadron Collider experiment hint new physics Emergence World: A Laboratory for Evaluating Long-Horizon Agent Autonomy Homebrew lead Mike McQuaid: Sandboxes and Worktrees - My Secure Agentic AI Setup Lean, Not Backpressure | Hacker News AI Dangers Eclipse Nuclear Weapons at Singapore Defense Forum Open source analytics that answers backbase How turkey hacked the hair-transplant industry How GPT Image 2 Is Transforming Marketing Workflows in 2026 Improve Git monorepo performance with a file system monitor Strava for Claude Code MiniMax M3 on Qubrid AI There's Something Else We Should Be Worrying About Celebrity Profile of an A.I. Actress What Is Windows K2? | Hacker News AI is devoid of meaning and humanity. Its vapid voice suits the political moment Show HN: Interpreto – Live Translation for Travel Taxicab Geometry Sealed classes and interfaces in Java (2025) Show HNs | Hacker News My AI Skill Edited This Video That Explains My AI Skill – Arcturus Labs Amazon Pinpoint End of Support The Mystery of the Backward Index MP/M's Process Dispatcher SlimTide Reviews: A Modern Solution for Metabolism and Energy Learning Lustre: Type-safe front end development with gleam Thomas Mann: Goethe Heartened by Panama (As Suez for English, or Danube-Rhine) How to make Message Log of the Unreal Engine 100 times faster Sum-product, unit distances, and number fields Can Meta Buy Belief? | Hacker News Twenty Years of Bigtable | Hacker News Show HN: Combine WigglyPaint GIFs into Video Show HN: AgentThreatBench – Benchmark for AI Agent Memory Security Genius Spotted in the Wild Napkins: Where Ethernet, Compaq and Facebook’s cool data center got their starts (2011) Moderate caffein use alters sleep-related EEG Nvidia Announces RTX Spark | Hacker News Show HN: Ministry of Everything – CLI agent harness for a single operator CEOs blame AI for layoffs, MIT prof says it fits a pattern to find cover story Bugs I didn't expect while building a zsh cleanup script for macOS dev machines Nvidia jumps into PCs with new chip debuting in laptops from Microsoft, Dell, HP Nvidia unveils PC 'superchip' in challenge to Apple and Intel Show HN: Having fun making mini static site apps Synthea API: Create Synthetic Medical Records as a Service Berkshire Hathaway to buy Taylor Morrison for $6.8B in cash The most complex model we understand [video] SanDisk stock is +4,440.53% in the past year Driftwm: What if your window manager worked like a whiteboard? US Immigration enforcement looks into buying ad data AI Is Creating More Work for Australia's Workplace Tribunal Finding New Biblical Cross-References with Codex Glide: A tiling window manager for macOS Ultra-highly efficient enrichment of uranium from seawater via studtite nanodots (2024)
GLM 5.2 Performance Benchmarks | Hacker News
theanonymous · 2026-06-17 · via HN's home page
GLM 5.2 Performance Benchmarks (artificialanalysis.ai)
97 points by theanonymousone 8 hours ago | hide | past | favorite | 35 comments
 help


It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark


This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?


A lot of benchmarks are setup to not punish false positives (irrelevant answers or extra text) and punish false negatives (missing the snippet being looked for).

This leads to answer bloat and/or hallucination if you benchmaxx on those


There is a tradeoff where as factual accuracy increases, creativity decreases, and the model becomes more "rigid" and less general. Unfortunately it seems that creativity is a good quality for reasoning and ultimately problem solving.

So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.


Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something


if this is the case, then GLM 5.2 model seems better than gpt 5.5 or maybe even "Fable" depending upon what you are trying to achieve.

Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)


Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.


It's simpler than that.

An LLM outputs tokens, one-by-one. It stops the loop if it outputs the end-of-text token. Which is, of course, statistically much rarer than any other kind of token.

(This is why you cannot, in general, prompt an LLM with something like "don't answer if the result is correct". It has to output something, by design.)


They are, especially multiple choice questions. The same happens with humans exams:

Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.

If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.


Where do you see that? I see they have GPT-5.5 (xhigh) at 55, GPT-5.5 (high) at 53, and Muse Spark at 43. Muse Spark does beat GPT-5.4 mini (xhigh) which scores 40, but the key there is "mini".

In the coding index, GPT-5.5 gets 59.1, 58.5, 56.2, and 52.1 for xhigh, high, medium, and low while Muse Spark is behind at 47.5. For agentic, GPT-5.5 gets 74.1, 72.0, 69.4, and 59.7 (xhigh, high, medium, low) while Muse Spark gets 62.0 (beating only GPT-5.5 low).

GPT-5.5 only gets beaten by Opus 4.8 in their general index, is the top spot for coding, and is #3 behind Opus 4.8 and GLM-5.2 for agentic (excluding Fable 5 which takes the top spot, but is unavailable).


It's always nice to see how open source models growing, hope we will have good performance with lower tier hardware some day.


Local models are already useful today. The next milestone is getting this level of performance onto truly affordable hardware.


NVidia has less than zero reason to ship cards ideal for this at low prices.

AMD’s stock price reflects a hope they launch a CUDA alternative. But this is unlikely for the near future.

There is a lot of interest in preventing China coming in with cheap AI hardware.

So I expect the direction to be good local models that few can run effectively.


The Chinese will flood the market with cheap AI chips just like they did with EV cars. As consumers we can’t thank them enough.


I think it will eventually result in regulation and a potential grey market, and/or implosion of the centralized LLM services — I doubt they can keep hardware from becoming cheaper forever, and diminishing returns will make consumer hardware suitable for all but the hardest problems. At that point, the hardware “moat” will be completely gone and have become an extreme unrecoverable sunk cost.


Well, most people were not liking Fable when it was available anyway, because it refused to answer questions very often.


On some it does yes, also in real usage.

It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already.


I'm glad those tests apparently work out for you but a benchmark where three of the top 5 models are different flavors of Gemini Flash and zero are anything by Anthropic, is just so wildly divergent from my personal experience with the models that it's not useful to me.

Whatever it is you're measuring, it's not anything related to what I use models for.


Thanks for the feedback!

What are you using Claude models for? Coding only? Computer use? Which harness?


Not only coding but also general knowledge work, anything from learning about how some things work (e.g. walking me through PNP vs NPN transistors) to summarizing texts, doing web research, and occasionally some OCR.

I've experimented with a few models for all this and have found Gemini the best at OCR but quite a bit worse at the rest. Claude is worse than GPT at web research-shaped things, but Opus 4.8 wins my anecdote benchmark for the other tasks besides those two.

But really, for code or knowlege stuff Gemini is markedly worse than the others, while Opus and GPT 5.5 are very very close.


There’s no way Anthropic can keep jacking up the prices like this for every marginally better model. I think even tokenmaxxing companies are going to soon balk at $50/million output tokens.


Anthropic wants to ban the alternatives through regulation and ideally provide differential access with differential pricing.