惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
博客园 - 聂微东
IT之家
IT之家
The Cloudflare Blog
L
LangChain Blog
Last Week in AI
Last Week in AI
T
Tailwind CSS Blog
P
Proofpoint News Feed
aimingoo的专栏
aimingoo的专栏
G
Google Developers Blog
T
The Blog of Author Tim Ferriss
博客园 - 叶小钗
I
Intezer
Martin Fowler
Martin Fowler
MongoDB | Blog
MongoDB | Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
ThreatConnect
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
小众软件
小众软件
T
The Exploit Database - CXSecurity.com
H
Help Net Security
T
Tenable Blog
WordPress大学
WordPress大学
F
Future of Privacy Forum
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
NISL@THU
NISL@THU
The Register - Security
The Register - Security
A
About on SuperTechFans
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
MyScale Blog
MyScale Blog
Malwarebytes
Malwarebytes
博客园_首页
T
Threatpost
C
CERT Recently Published Vulnerability Notes
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
C
CXSECURITY Database RSS Feed - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Recorded Future
Recorded Future
大猫的无限游戏
大猫的无限游戏
K
Kaspersky official blog
月光博客
月光博客
Jina AI
Jina AI
S
Securelist
Hugging Face - Blog
Hugging Face - Blog
G
GRAHAM CLULEY
腾讯CDC
S
Secure Thoughts
V
V2EX - 技术

The Decoder

George Hotz says coding agents will be "one of the most costly mistakes" in software development AI models often give the right answers but point to the wrong sources ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training Deepmind's Hassabis sees humanity "in the foothills of the singularity" while LeCun says current AI isn't intelligent Why you shouldn't leave model selection on default in Copilot, Gemini and other AI tools Anthropic may keep supplying Claude to the NSA despite being flagged as a supply chain risk by the Pentagon Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn't have designed Deepseek makes its 75 percent discount permanent, pricing output tokens at least 34x below GPT-5.5 One of the world's top law schools draws a hard line against AI in legal education Alibaba's latest AI model ran autonomously for 35 hours to optimize code for its own custom chip Google CEO Pichai now calls links a "part" of search, redefining the web's role in its own product Anthropic warns Claude Mythos Preview finds bugs faster than developers can patch them Cloudflare CEO Prince says builders and sellers are safe but AI is coming for the measurers OpenAI launches a ChatGPT Powerpoint plugin and warns it might accidentally delete your content Deepseek reportedly prioritizes AGI research over quick profits despite billions in funding OpenAI Appshots turn any Mac window into context for Codex OpenAI burned through $1.22 per dollar earned even after stripping out stock-based compensation California governor signs first US executive order to protect workers from AI job loss Trump pulls AI safety order after last-minute calls from Musk, Zuckerberg, and Sacks Google checks websites for llms.txt in new agentic browsing audit OpenAI shifts the boundary of automated reasoning with a "milestone in AI mathematics" that experts are now unpacking US Cyber Command races to deploy AI on top-secret networks Cohere open-sources its strongest model yet Anthropic is about to become the first profitable AI lab OpenAI could file confidential IPO paperwork within days SpaceX IPO filing shows billions in AI losses, a $2 trillion valuation target, and turbine spending that signals more data center conflicts ahead SAP taps Mistral AI to help customers migrate legacy software Deepseek wants to take on Claude Code and OpenAI's Codex with "Deepseek Code" LinkedIn's war on AI slop is not just a policy update—it is an admission that the platform lost control of its feed Google tests the app market version of the SaaSpocalypse Stability AI launches Stable Audio 3.0 with up to six-minute tracks and open weights Google pairs its Genie world model with Street View to create explorable AI worlds based on real places Google's Gemini 3.5 Flash follows Anthropic and OpenAI in making newer AI models significantly pricier Google overhauls its AI subscriptions at I/O 2026 with three tiers starting at $10 a month Sorry for the outages: Bot spam is pushing our servers to the limit Google's I/O announcements: new models, a cloud agent that never sleeps, and a redesigned Gemini app Prominent AI researcher Andrej Karpathy picks Anthropic over former home OpenAI to get back into frontier LLM research Agora-1 turns the N64 classic GoldenEye into a playable AI simulation for four players Mistral AI acquires Viennese physical AI startup Emmi AI Cloudflare says Anthropic's Mythos Preview finds exploit chains that earlier frontier models missed Anthropic adds self-hosted sandboxes and MCP tunnels to Claude Managed Agents Elon Musk appeals $134 billion OpenAI loss, calls verdict a "calendar technicality" Elon Musk loses his $134 billion lawsuit against OpenAI after jury deliberates for just two hours Cursor's Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of the cost Pope Leo XIV presents first AI encyclical, Anthropic co-founder invited as guest speaker A Stanford student reflects on his ChatGPT class and a culture of "just a little bit of fraud" MAGA-aligned groups want government oversight of frontier AI models Anthropic to brief global financial regulators on cyber flaws found by Claude Mythos AI startup revenue hits $80 billion, but Anthropic and OpenAI take almost all of it World Action Models give robots the ability to simulate consequences before they move Greg Brockman consolidates OpenAI's product teams to build an "agentic future" Mistral CEO Arthur Mensch warns France against letting Anthropic's Mythos scan military code bases New math benchmark reveals AI models confidently solve problems that have no solution Four AI models ran radio stations for six months and the results ranged from competent to unhinged Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously YouTube opens its deepfake face-swap detection tool to all adult creators New benchmark confirms AI video generators look stunning but still can't reason about the world OpenAI bought a voice cloning startup famous for celebrity imitations For $1.3 million a month, OpenClaw founder Peter Steinberger runs 100 AI agents that code, review PRs, and find bugs AI made a tiny slice of Silicon Valley filthy rich and left the rest wondering why they bother Researchers train AI model that hits near-full performance with just 12.5 percent of its experts Google says GEO and AEO are a myth and traditional SEO is all you need for AI search Google busts the myth that AI search needs its own SEO playbook ChatGPT now wants access to your bank account so it can tell you to stop ordering takeout Anthropic's $900 billion valuation would make it more valuable than OpenAI for the first time x.AI plays catch-up with Grok Build, its first terminal-based coding agent Microsoft pulls Claude Code licenses and pushes developers back toward its own AI tool Arxiv cracks down on unchecked AI-generated content in research papers Anthropic frames AI competition with China as a now-or-never moment for Washington OpenAI makes its AI coding assistant Codex available on iOS and Android Americans would rather live next to a nuclear plant than an AI data center, Gallup poll finds Microsoft pits more than 100 AI agents against each other to find Windows vulnerabilities Ten Chinese firms including ByteDance reportedly get US clearance for AI chips they're not allowed to accept Alibaba's Qwen-Image-2.0 doubles compression and cuts generation steps from 40 to 4 ChatGPT's web traffic share dropped from 78% to 54% in one year as Gemini quietly tripled its reach New Claude Mythos becomes the first AI model to clear all cyberattack simulations from Britain's AI safety agency Microsoft's Edge Copilot can now read all your open tabs at once and write for you on LinkedIn Claude subscriptions get separate budgets for programmatic use, billed at full API prices Tencent plans to ramp up AI spending as China's chip supply allegedly improves Anthropic overtakes OpenAI in B2B adoption for the first time according to Ramp spending data Meta AI gets a private mode where no conversation data is stored on servers Anthropic launches Claude for Small Business to embed AI into the tools you forgot you pay for Luma opens Uni-1.1 image model API at prices and quality matching OpenAI and Google China's AI suppliers can't keep up as critical component shortages hit production AI startup Recursive emerges from stealth with $650 million to build self-improving AI Google is hiring hundreds of engineers to help customers adopt its AI From Prompt to Pointer Engineering: Deepmind tries to reinvent the mouse cursor for the AI era Android gets AI agents that book trips, fill forms, and clean up your texts Anthropic expands legal AI offerings with new Claude Cowork plugins Google says it stopped a mass cyberattack after AI was used to discover a zero-day exploit Alphabet's Isomorphic Labs raises $2.1 billion to scale AI drug discovery toward clinical trials Microsoft ousts its Israel chief following reports that Azure quietly powered military AI targeting in Gaza "Tokenmaxxing" spreads at Amazon as employees game internal AI leaderboards Thinking Machines Lab ships its first model and argues interactivity is what OpenAI gets wrong about voice Sam Altman's personal investments face political scrutiny ahead of OpenAI's planned IPO The EU wants to regulate AI but needs OpenAI and Anthropic to let regulators through the door Baidu's Ernie 5.1 cuts 94 percent of pre-training costs while competing with top models OpenAI's DeployCo subsidiary adopts Palantir's playbook, building a moat from workflows no lab can simulate Lawsuit claims ChatGPT coached FSU shooter on gun operation, timing, and victim thresholds
Google Deepmind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars
Matthias Bas · 2026-05-25 · via The Decoder

AlphaProof Nexus combines LLM-driven proof generation with machine verification to crack open math research problems that have stumped mathematicians for decades.

Google Deepmind's new framework AlphaProof Nexus has autonomously solved nine out of 353 open Erdős problems it attempted, including two questions that had gone unanswered for 56 years.

The system also proved 44 out of 492 open conjectures from the Online Encyclopedia of Integer Sequences (OEIS), settled a 15-year-old question about Hilbert functions in algebraic geometry, and improved a known bound in convex optimization. Inference costs ran just a few hundred dollars per problem, according to the research paper.

Unlike (potentially) pure natural-language approaches such as OpenAI's recent solution, the underlying language model in AlphaProof Nexus—in this case Gemini 3.1 Pro—doesn't have to carry the entire logical chain on its own.

Instead, it generates proof steps in Lean's formal language, and the compiler checks each one. Error messages feed directly back into the next attempt. That way, the LLM gets grounded by symbolic feedback, a safety net that offsets the well-known weaknesses of language models when it comes to logical reasoning. Humans only step in at the very end to check the results.

Four agents, one surprising result

The system consists of four agent variants with increasing complexity. The simplest, Agent (A), deploys independent sub-agents running on Gemini 3.1 Pro in loops: the language model generates proof steps, the Lean compiler checks them, and error messages feed back into the next try.

Agent (B) adds queries to AlphaProof, Google's reinforcement-learning-based system for olympiad math, which can fill in missing proof segments. Agent (C) introduces an evolutionary component. Inspired by AlphaEvolve, sub-agents share a common population of proof sketches. Rating agents built on Gemini 3.0 Flash score these sketches for plausibility and novelty, then rank them using an Elo system. The fully equipped Agent (D) combines all of these capabilities.

Agent (D) was used for the Erdős problems. But a post-hoc analysis turned up a surprise: the simplest Agent (A), which only uses an LLM and compiler feedback, could also prove all nine solved Erdős problems, albeit pricier on the hardest ones.

The researchers attribute the simple agent's success to two factors: rapid improvement in the underlying language models and the "power of compiler feedback in grounding LLM reasoning." The fully equipped agent still holds an edge on the toughest tasks for now, but that lead could shrink as LLMs get better. The researchers say this points to a broader trend, describing "an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable."

Six charts plotting solve rate (Y-axis) against mean cost in USD (X-axis) for Erdős problems 12(i), 12(ii), 125, 138, 152, and 26. Four agent variants are color-coded: (A) basic in blue, (B) basic with AlphaProof in orange, (C) basic with evolution in green, and (D) full in red. Numbers at data points indicate the number of sub-agents. On easier problems, all variants converge at high solve rates; on harder problems like erdos_125, solve rates stay low overall but rise with more sub-agents and higher cost.
Solve rate vs. cost for six of the nine solved Erdős problems: On easier tasks like erdos_26, all four agent variants hit high success rates. On harder problems like erdos_125 or erdos_152, clear gaps emerge. The fully equipped Agent (D) sometimes gets there with fewer attempts, but the simple Agent (A) also succeeds given enough budget. | Image: Tsoukalas et al.

Useful even without a complete proof

The system's successes cluster in areas like combinatorics, convex optimization, and number theory, where Lean's math library Mathlib is mature and problems break down into manageable sub-goals. Most Erdős problems remained out of reach, "let alone problems that require extensive new theory," the researchers write. The agents also inherit the unreliability of the underlying language models.

Still, they see value beyond solved problems. Mathematicians who worked with the system reported that even failed proof attempts deepened their understanding of a problem, or as the authors put it, "AI-driven formal proof search can serve not only to solve problems but to deepen human understanding."

Because the sketches were formal, experts could focus on the unsolved sub-goals instead of re-checking the entire argument from scratch. The agents also proved effective at catching flawed formalizations in the literature. "Formal verification can serve as a filter for determining which proofs merit human review," the authors write.

The system is already being used in ongoing research on quantum optics and graph theory, according to the paper. All Lean proofs and selected natural-language proofs are available on GitHub.

Three-column diagram showing AlphaProof Nexus's proof process for Erdős problem #125: on the left, the Lean input file with EVOLVE-BLOCK markers and a sorry placeholder; in the center, the prompt with prior attempts, Elo ratings, and the current plan; on the right, the step-by-step proof with chain-of-thought reasoning, search-replace operations, AlphaProof calls, and final validation of all six sub-goals.
How AlphaProof Nexus solves Erdős problem #125: The agent receives a Lean file where the actual proof has been replaced by a gap (a), sees prior attempts with Elo ratings and a current plan in the prompt (b), then breaks the proof down step by step, calls AlphaProof for sub-goals, and refines failed steps by decomposing them into lemmas until all goals are proved (c). | Image: Tsoukalas et al.

Erdős problems become the benchmark for AI math

OpenAI recently used a proprietary reasoning model to disprove Erdős's unit-distance conjecture. Fields Medalist Tim Gowers called it "a milestone in AI mathematics." Before that, GPT-5.2 Pro helped solve Erdős problem #281, with Terence Tao calling the case "perhaps the most unambiguous instance" of an LLM solving an open math problem. Thereafter, GPT-5.4 solved another Erdős problem.

In some ways, those results are more impressive than Deepmind's approach. The language model had to carry the entire logical chain through natural language, without a Lean compiler checking each step. AlphaProof Nexus is more systematic and scalable, but it's tackling a different goal: building a reliable AI tool for everyday math research. OpenAI could integrate Lean into their scaffold as well, of course, but the point there is more about testing raw LLM capability.

Tao in the past warned against reading too much into the headlines, though. AI's actual success rate on Erdős problems sits at just one to two percent, concentrated on easier tasks. Google's system cracked only nine out of 353 problems. That lines up almost exactly with Tao's two-percent bar.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now