惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

Hacker News - Newest: "AI"

HN isn't swamped yet, just obsessed with AI · mahl.me OneHundredBiz — Financial Business Ideas with AI Tools An AI system to help scientists write expert-level empirical software Ask HN: We need a standard way to say how much AI was used in a PR Anthropic, Microsoft in talks for AI chip deal after $5 billion investment Idea: Subreddits as curator blogs for the AI era The elephant in the room • Josh W. Comeau What Happens When AI Edits a Classical Chinese Academic Paper: What Happens When AI Edits a Classical Chinese Academic Paper / 当AI修改古汉语学术论文时发生了什么 China's AI optimism isn't what it seems Ask HN: How much AI is in your writing? wwwatch · AI intel for builders Diia - Ukraine gov app launched AI agent based on Google Gemini The IPO wave will enshrine the AI gods' control over the future We shipped 30 tools to our agent. The most-used one just reads our documentation. - kapa.ai - Instant AI answers to technical questions How we work: AI skills - Easy Cyber Protection Governor Newsom signs first-of-its-kind executive order to prepare workers and businesses for potential AI disruption | Governor of California Another California tech company lays off thousands - Los Angeles Times How the AI backlash could cost investors AI Has a Memory. It Just Doesn't Know What to Remember The Companies Cutting Headcount for AI Will Lose to the Ones Who Didn't Ask HN: Is there a better and more affordable AI coding tool than Claude? Food for Agile Thought #545: R/L Agentic Chaos, AI Killed the Agile Industry The current AI pricing was always going to go away A top K-drama star faces explosive backlash over AI-manipulated voice evidence Clickup mocks employees over AI 8 days before layoff Automated Expert Extraction: Behavioural Telemetry of Nyx Wave Ban on Authors Who Submit AI Content “Welcome but Unenforceable” Hollywood in the 60s and the Good AI Future — Joel Dueck Proton Pass for AI Agents Baby Magic-AI Baby Image & Video Generator Online Interactive AI Chat - Chrome 应用商店 Google I/O showed how the path for AI-driven science is shifting Google makes Gemini 3.5 Flash the default AI model for billions of users - Tech Three Dots AI didn't kill your junior pipeline. You did | Andrew Murphy Adobe, Canva, CapCut Are Coming to Gemini to Help You Edit AI Creations "Erase," an AI tool that can remove unwanted objects from images Steve Wozniak cheered after telling students they have AI – actual intelligence AI-Assisted Engineering Habits Worth Stealing (Week 2 Roundup) The best engineers in 2026 aren't the best coders. They're the best at not trusting AI code. GitHub - Woodman97/lucy-agent: AI agent for writing, research, code, DeFi & blockchain. Pay per task in USDC on Base or Solana. A2A + MCP + x402 protocols. $200/month per developer on AI tools. Most companies can't explain what they're getting. Spotify and UMG Announce Licensing Deal to Allow for AI Covers and Remixes CodeAlta After Automation Acrisure layoffs to number 2,250, attributed to AI advancements Report Alleges Chinese Influence Behind AI Data Center Pushback in the U.S. Pressure from Silicon Valley helped block Trump’s expected order on AI AI may be inflationary before it becomes productive Cisco used AI to write security incident reports, with mixed results PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications GitHub - ai-mf/media-engine Ask HN: What the Best AI for Coding? Meet Hell Grind, The First Feature Film "Created Entirely On The Higgsfield AI Platform" Navigating AI with paper maps The Unsustainable Subsidy An Uncharitable Taxonomy of the AI Discourse ReCardEx — AI Product Photography for Marketplaces White House yanked AI order after David Sacks raised industry concerns Best Practices to Produce Maintainable Code with AI [video] AI Slop & the Vulnerability Treadmill Crypto and AI-Funded Super PACs Are Metastasizing The AI Bubble — No One's Happy Lam Research focused on adding AI to chipmaking tools as it eyes US expansion Donald Trump abruptly postpones AI order after White House infighting Tell HN: I'm tired of AI-generated answers Design prompting: describe the world, not the widget AI Local Recorder App - App Store erlang_python — erlang_python v3.0.0 Outlier AI is paying cardiologists to review ECGs and train AI models (referral) Agentic Engineering Memory — A Memco Field Guide Igor Babuschkin Seeks Up To $1 Billion For River AI AI is killing the cheap smartphone web-ai-sdk · Building blocks for the Web's built-in AI China unveils 'world's first' underwater data center — 2,000 server facility is powered by offshore… AI for Solo Founders: Virtual Coffee Chat & Networking - #BosTechWeek | Partiful The Structural Barriers to AI Lawyers Roundtables: Can AI Learn to Understand the World? Spotify and Universal Music agree deal to let subscribers create AI remixes AI Tokenomics: How to Profitably Turn Tokens into Business Value [video] AI-assisted engineers are burning out, is this fine?—Martian Chronicles, Evil Martians’ team blog Trump pulls back AI order over fears it could slow US technology | AP News GitHub - simd-ai/agent Spotify and Universal Music strike deal allowing fan-made AI covers and remixes Best AI Audiobook Maker | Warblize dhrive: Squarespace for mobile apps GitHub - fireharp/coherence: Git-native drift detector for agent-assisted repos: catch stale docs, ADRs, tests, metrics, and generated artifacts. The AI has come for my code - The Boston Diaries Show HN: Synrix: hardware-verified memory routing for edge AI agents Starbucks scraps AI inventory tool across North America GitHub - bjcoombs/ai-native-toolkit: Claude Code configuration and customizations GitHub - VenturFlow/Assay Tanya Janca on AI Slop, Vibe Coding, & the Future of AppSec Ask HN: What is an optimal game theoretic response to AI adoption? Ask HN: What AI prompts have you found most reliable for actual work? White House postpones AI executive order signing ceremony Trump Postpones AI Executive Order Due to Concerns About Overregulation Show HN: Canonry tracks how AI cites you – agent-first, open source AMD Ryzen™ AI Halo for AI Developers I had to do therapy on my AI — Tin's Posts — Tin Marković Ask HN: Anyone else struggling with AI and work?
EBM vs. LLMs: Our Kona EBM a 96% vs. 2% Sudoku Benchmark
Topfi · 2026-05-16 · via Hacker News - Newest: "AI"

Kona EBRM Sudoku Demo

Our Sudoku demo is a simple way to see the difference yourself.

We have Kona solve hard sudoku puzzles in real-time alongside a group of popular LLMs, including GPT-5.2, Claude Opus 4.5 and Sonnet 4.5, Gemini 3 Pro, and DeepSeek V3.2. After about a week of public access, Kona solved 96.2% of puzzles at an average of 313 milliseconds. Frontier LLMs together had a solve rate of only 2%, taking up to 90 seconds before an incorrect answer or time out.

The reason LLMs struggle with sudoku is not that the rules are especially complicated; A sudoku cell is 81 cells arranged in a 9x9 grid, where each row, column, and 3x3 box must contain the digits 1 through 9 exactly once. But LLM architecture isn’t well-suited for spatial tasks; solving these puzzles requires holding the whole grid in mind and reasoning about how a change in one cell affects every other cell in the same row, column, and box.

Because LLMs generate tokens sequentially, committing to each one as they go, they cannot easily revise earlier decisions when they discover a conflict later. This is not a challenge that can be solved by adding more compute or seeing more sudoku puzzles - it is an inherent structural limitation on autoregressive language models.

See the live benchmark, at sudoku.logicalintelligence.com.

We disabled code execution for all models in the demo, to prevent the LLMs from writing a brute-force backtracking solver in Python and running it, which would tell us something about their coding ability but nothing about their reasoning ability. With code execution off, the models have to actually work through the puzzle using whatever reasoning capabilities they have, and those capabilities turn out to be poorly suited to constraint satisfaction.

We also did not fine-tune these LLMs on sudoku problems. Fine-tuning would test pattern recognition over memorized solutions, not reasoning. Neither was Kona trained on solved puzzles; it learns the constraint structure and solves novel configurations from that information alone. The comparison is between reasoning architectures, not between levels of domain-specific exposure. *In a separate experiment, we trained the model only on partial solutions (50% of solution being masked), and the model still learned the rules and was able to generate full solutions at test time.

The LLMs receive a puzzle and start producing chains of reasoning: "Let me analyze by rows, columns and boxes," "for cell (4,1), the row needs 1, 2, 3, 5, 7, 8, 9," and so on, sometimes generating hundreds of tokens of step-by-step analysis. Then they output a grid that has duplicate digits in a row, or a missing digit in a column, or some other constraint violation. DeepSeek V3.2 tends to finish quickly but with many duplicates across rows, columns, and boxes - failing to solve the puzzle. The Claude models often think for a long time and then produce something close but still wrong. GPT-5.2 does best among the LLMs at 6.9%, which is still a 93% failure rate on a puzzle that any patient human can solve.

Kona approaches the problem differently because it's a different kind of model. Instead of generating a solution token-by-token, Kona produces a complete candidate grid and evaluates it against all the constraints simultaneously, assigning a scalar energy score where low energy means the constraints are satisfied and high energy means something is violated. If the energy is high, Kona can identify which constraints are broken and revise the relevant parts of the grid without starting over, using gradient information in continuous latent space to move toward valid configurations. This process takes about 313 milliseconds on average, compared to the 30+ seconds the LLMs spend generating reasoning traces that lead nowhere.

Kona EBRM Compute Cost

The cost difference is also notable. Running Kona on the nearly 13,000 puzzles it has actually solved has so far cost about $4 in GPU time, and roughly 1.1 hours of compute. The LLM API calls for the same demo at a 98% failure rate have cost around $11,000, mostly from Claude's tokens.

If a model architecture can’t handle sudoku, where the constraints are simple and the state space is tiny, it is not going to handle supervising industrial control systems or billions of dollars in global transactions. The gap between 96% and 2% is not about sudoku - it’s a demonstration of how completely different these architectures are designed.