惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Blog — PlanetScale
Blog — PlanetScale
IT之家
IT之家
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 三生石上(FineUI控件)
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Latest news
Latest news
博客园 - 【当耐特】
美团技术团队
Schneier on Security
Schneier on Security
S
Secure Thoughts
T
Tailwind CSS Blog
J
Java Code Geeks
E
Exploit-DB.com RSS Feed
博客园_首页
Attack and Defense Labs
Attack and Defense Labs
T
Tor Project blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
雷峰网
雷峰网
Hugging Face - Blog
Hugging Face - Blog
P
Privacy International News Feed
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Cyberwarzone
Cyberwarzone
T
Tenable Blog
云风的 BLOG
云风的 BLOG
大猫的无限游戏
大猫的无限游戏
Google DeepMind News
Google DeepMind News
Recent Announcements
Recent Announcements
Spread Privacy
Spread Privacy
爱范儿
爱范儿
量子位
S
Security Affairs
Stack Overflow Blog
Stack Overflow Blog
O
OpenAI News
T
Troy Hunt's Blog
Martin Fowler
Martin Fowler
TaoSecurity Blog
TaoSecurity Blog
D
Docker
Apple Machine Learning Research
Apple Machine Learning Research
N
News | PayPal Newsroom
The GitHub Blog
The GitHub Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
月光博客
月光博客
S
Securelist
Vercel News
Vercel News
博客园 - 聂微东
PCI Perspectives
PCI Perspectives
B
Blog RSS Feed
T
The Exploit Database - CXSecurity.com
V
Visual Studio Blog
MongoDB | Blog
MongoDB | Blog

The New Stack | DevOps, Open Source, and Cloud Native News

Agentic development hinges on verification. For cloud-native software, that is a runtime problem. AI agents need infrastructure: Why Europe’s regional cloud strategy matters Transform your AI coding agent into a deterministic Java Spring expert WeAreDevelopers is coming to the US to give unsung developers a bigger voice Cleaner AI training data, fewer bugs: Sonar’s SonarSweep explained Observability overload is drowning engineers Google’s DiffusionGemma is 4x faster than its other Gemma models Fable 5: Guardrails and burn rate are annoying users, who say it’s still better than Opus 4.8 The Anthropic leader who built Claude Code says he ditched prompting — now he just writes loops. AWS can now mathematically prove your VMs are isolated Microsoft pulled 73 GitHub repos after malware attack — but still won’t say who’s compromised Databricks wants to kill the “email me a file” problem for AI agent skills Ramp bets forward deployed engineers can do what off-the-shelf finance AI can’t Git real: AI agents aren’t just for solo developers anymore Anthropic launches Claude Mythos/Fable 5, but you better try it soon This AI agent startup ditched Anthropic for DeepSeek — and says it’s saving millions When your data model is the bottleneck: lessons from Medium’s feature store How long before we stop reading the code? The tokenmaxxing party is over, and Revenium is mopping up How AI is solving the memory crunch it created Microsoft’s pitch to enterprises: Ditch Azure Repos for GitHub, despite its rocky reliability record Claude Code’s biggest upgrade yet ran 5 agents at once — here’s what happened Why Anthropic just doubled Claude Cowork limits at no charge For years, Apache Cassandra handed this work to your team — 6.0 takes it back “A dangerous combination”: The 2 factors that can “corrupt” AI agent workflows With Foundry, Microsoft bets the enterprise AI battle is about reliability, not capability Microsoft unlocks Visual Studio for developers left behind by its own AI AI teams now deploy 1,000 times a month. Your pipeline wasn’t built for that. Microsoft just made the agent runtime free — and kept everything around it “Whoever builds the most joyous product wins”: The agent war begins Netlify CTO Dana Lawson: Writing code is no longer the job From Jupyter Notebook to production: How to ship AI systems that actually work OpenClaw used Gavriel Cohen’s code and exposed the AI Agent accountability problem Replit shows how vibe coding is getting its own financial stack — and a path to profit Cloudflare aqui-hires VoidZero: Did a piece of the open web just stabilize, or become more brittle? Cursor cuts prices and adds enterprise spend controls amid “tokenomics” reckoning Google Gemma 4 12B nearly matches 26B benchmarks — and runs on your laptop Snowflake thinks it knows what’s really slowing developers down Autonomous agents have met their biggest challenge yet: The database. Why agentic AI makes the ops platform the most important layer in the enterprise How to dramatically improve enterprise security alert tuning to battle cyberattacks Why the need for humans won’t disappear in the age of autonomous databases How to secure Kubernetes in the age of AI workloads Asana says its new AI “chief of staff” turns your Slack chaos into trackable work Nvidia’s best model is now live Mate Security’s Asaf Wiener made every backend engineer a model router. He’s right to. The AI cost crisis finally has a watchdog — just not the companies causing it How to get operational data off the factory floor without creating an IT breach Why CPUs still matter in the age of AI agents Rayfin: Microsoft’s answer to the gap between vibe coding and enterprise production Microsoft bets the enterprise AI race will be won on data context, not model power “A successful attack could be catastrophic”: Anthropic gives more groups access to Claude Mythos How GitHub plans to win developers back Microsoft really, really, really wants developers to love Windows again With Intelligent Terminal, Microsoft is reinventing the Windows terminal Microsoft debuts “Scout” at Build, a new personal agent for work OpenAI’s Codex adds new tools — Sites, Annotations, more plugins — for knowledge workers GitHub Copilot’s usage-based billing is live: Here’s what you need to know OpenAI, Anthropic, Google, Amazon, and xAI all fail on type of attack, study finds JetBrains open-sources Mellum2 to go where Claude Code can’t Claude Code vs. Cursor vs. Codex vs. Antigravity — six months in This coding agent doesn’t want your feedback — it ships without it “Blowing things up”: The one move vendors got wrong on AI agents At Sapphire, SAP makes the case that enterprise AI is a context problem Gavriel Cohen found his own code inside OpenClaw, so he walked away AI retrieval at scale is becoming a systems problem, not a tooling problem The DIY platform trap that’s burning out engineering teams I tested Cursor’s new Jira integration and it’s 5 stars, no notes. Here’s why. Why GPT-5.4, Claude, and Gemini can’t agree on basic, real-world facts Replit’s vibe coding platform just got a Visa-backed identity layer for AI agents — and it changes how agents spend money Opus 4.8 Made Claude Smarter. Token Discipline Got Urgent. Why Linux creator Linus Torvalds gets angry hearing “99% of code is AI” Vendor neutrality isn’t magic: A hard look at the OpenTelemetry ecosystem “The AI did it” won’t save you when EU regulators come knocking The fix for soaring AI cloud bills exists — so why won’t we trust it? AI is shipping code faster than security was built to handle Why AWS scrapped OpenSearch’s architecture to chase agent workloads Claude Opus 4.8 is here: effort controls, dynamic workflows, cheaper fast mode, better honesty, less deception Percona celebrates 20th birthday with new foundation — and a goat cake Why OpenAI and Anthropic are hiring forward deployed engineer teams Claw-style AI agents are coming to the enterprise. The governance infrastructure is still catching up. The agentic identity crisis: Why your security isn’t ready for the AI revolution Debugging the undebuggable: building observability into probabilistic AI systems Snowflake commits $6B to AWS as it pushes deeper into AI Why MotherDuck refuses to fork DuckDB Researcher “gave Claude Code ‘ADHD’… and it thinks 2x better now.” Outside experts want more proof. “There is no accountability”: AI coding agents are installing packages no one owns “Tokenmaxxing is real, expensive & it’s spreading”: AI budgets are exploding With Google’s debut, the most important AI agent feature is now the most boring one Why AI agents need a Context Lake Google ranks the best AI for building Android apps, and the winner isn’t Gemini Google pushes Pro, Ultra, and free users from open-source Gemini CLI to closed-source Antigravity CLI The reason enterprise outages almost never start where ops teams think Taming the agentic influx: a blueprint for AI business observability How the AC/DC framework helps teams govern AI coding agents GitLab 19.0 trades its string section for a full DevSecOps orchestra Who’s monitoring the agents? How Jaeger hit 8.6× compression on 10 million spans with ClickHouse What ClickHouse learned from a year of coding with AI agents OpenClaw passed 300,000 GitHub stars. Then Google launched Spark.
Xiaomi's MiMo Code claims it beats Claude Code past 200 steps
http://www.facebook.com/janakiramm · 2026-06-15 · via The New Stack | DevOps, Open Source, and Cloud Native News

A coding agent that scaffolds a working app over lunch will routinely stall around 30 steps into a production refactor. It locks onto a hypothesis early and keeps patching a wrong assumption, so small errors compound until the run comes apart.

It’s a scenario that Xiaomi’s MiMo AI team may soon resign to history, as it has now open-sourced MiMo Code, a terminal-native harness that the company claims outperforms Anthropic’s Claude Code on agentic tasks that run beyond 200 steps. The benchmark is self-reported, drawn from Xiaomi’s own beta and a survey of 576 developers, so it reads as a marker rather than a finding.

“The endurance gap names that distance, the steps an agent survives before it drops the task.”

The number matters less than the axis on which Xiaomi chose to compete. Long-horizon reliability — holding a task together across hundreds of dependent steps — is the new front in coding agents. The field is only now learning to measure how far an agent gets before it loses the thread. The endurance gap names that distance, the steps an agent survives before it drops the task.

Where long-horizon agents break

Ask any agent to build a small app from a clean prompt, and it performs. Carrying a single objective through a few hundred steps of editing, testing, and revision is where the failure modes show up on schedule.

Three failures recur in the long run:

  1. Hypotheses harden too early, so the agent keeps patching a wrong assumption.
  2. Step 40 inherits the mistakes of step 12, and the errors compound.
  3. The context that mattered at the start drifts out of view by the middle of the run.

Practitioners have described this collapse for months. One widely shared account from the team behind Ejentum put the breakdown near step thirty, naming hypothesis lock-in and error compounding as the usual culprits. Think of a long batch job run without checkpoints. A crash forces the whole job to restart from the beginning rather than the last saved state.

The cleaner way to assess the gap is to grade shipped work rather than rely on demos. A new benchmark from Berkeley set out to do exactly that.

Inside Berkeley’s Last Exam

Dawn Song and postdoc Yiyou Sun at UC Berkeley’s RDI lab built Agents’ Last Exam. The team reported that more than 250 industry experts across 55 occupations shaped it. The benchmark is strict by design, built to expose where agents fall short rather than where they shine. Each task is a real project a professional has already shipped, converted into a code-graded test with no human judge in the loop. The agent gets full access to a graphical interface and a command line. It solves the task however it would, and the benchmark scores only the artifact it leaves behind.

The headline finding undercuts anyone selling agents as ready for work. Berkeley’s team paired Codex with GPT-5.5, the strongest configuration tested. The paper reported below 50 percent even on the easiest tier, and under 10 percent on the hardest. Most mainstream agents, Claude Code among them, record near-zero pass rates at that difficulty. Song put the takeaway plainly: Today’s agents handle a meaningful slice of professional tasks while the hardest long-horizon work stays out of reach.

Berkeley’s results landed the same week Anthropic shipped Fable 5 into a wave of job-ready marketing. The exam answered that pitch with a number rather than a promise.

The value of the exam is that it grades the finished artifact rather than the demo. A model can top a coding leaderboard and still leave a half-finished deliverable when the task runs long, which is the gap a code-graded floor is built to catch.

The harness layer

The most active work in coding agents now sits in the harness, the layer that holds state, paces the work, and decides which thread to pursue next. Three approaches show where it is heading.

Anthropic reported that its nested subagents in Claude Code run capped at five levels deep. A frontier model plans, while cheaper subagents carry out the work and spawn their own helpers. Researchers at Renmin University of China took a different route with Arbor. It pairs a long-lived coordinator with short-lived executors and a persistent hypothesis tree, one that checkpoints progress and resumes after an interruption. Xiaomi’s MiMo Code addresses the same problem from the open-source side, tuning a terminal-native harness for runs beyond 200 steps.

These are not interchangeable, and the reported evidence varies widely in confidence.

Every credible answer externalizes state, and Arbor makes that point most plainly of the three. Its design treats the hypothesis tree, not the context window, as the thing that has to survive in the long run. To appreciate why that holds, we need to look at how durable workflow engines solved the same problem years ago. Whatever a system cannot checkpoint must be redone after a failure.

Why does the endurance gap matter?

For an enterprise wiring agents into a delivery pipeline, the endurance gap is a procurement question rather than a curiosity. An agent that fails at step 30 in production does not announce it. It hands back a plausible artifact built on an early wrong turn. The cost will surface later as rework, silent defects, and reviews that have to assume the worst. A benchmark like Berkeley’s gives buyers a code-graded floor for how far an agent can carry out a task before a human is needed.

“An agent that fails at step 30 in production does not announce it. It hands back a plausible artifact built on an early wrong turn.”

Teams should ask how a candidate harness holds state across a long run, whether it can resume from a checkpoint, and where its own published numbers put the ceiling.

Enterprises can now treat endurance as its own line in the evaluation. That means measuring how far an agent carries out a task rather than reading a coding leaderboard score and assuming the rest. Teams should ask how a candidate harness holds state across a long run, whether it can resume from a checkpoint, and where its own published numbers put the ceiling. Some vendors lead with a model benchmark and stay quiet on long-horizon behavior, answering a question the reader stopped asking.

It would be a mistake to wave either result away. Xiaomi’s 200-step claim could hold up once an independent harness runs it. Berkeley’s hardest tier may overweight tasks that no team would hand to an unsupervised agent in the first place. If models keep extending the span over which they stay coherent, part of the gap closes from the model side, and the harness earns a little breathing room. None of that changes the near-term reality, where an agent’s finished output breaks well before its demo ever does.

What’s next

For developers tracking the tool landscape, the signal has moved away the model headline. It now sits on two numbers: How long an agent stays coherent and who verified it. MiMo Code, Arbor, and Claude Code’s subagents are early entrants in a contest the field has barely learned to score.

“The leaderboard that matters is the one graded on shipped work rather than self-reported wins…”

The next thing to watch is whether independent runs of Agents’ Last Exam confirm or deflate the endurance claims now stacking up. The other is whether the harness hardens in the place where coding agents are bought and sold, a shift already underway. When the benchmark gets its first contested leaderboard, the figure buyers will check first is the step count an agent survives before a human steps in. That number turns a vendor promise into a procurement test.

In summary, the leaderboard that matters is the one graded on shipped work rather than self-reported wins, and that discipline benefits every team wiring agents into a delivery pipeline.

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.

Created with Sketch.