惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recent Announcements
Recent Announcements
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
博客园 - Franky
D
Docker
H
Help Net Security
S
SegmentFault 最新的问题
AWS News Blog
AWS News Blog
P
Palo Alto Networks Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
雷峰网
雷峰网
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
L
LangChain Blog
Attack and Defense Labs
Attack and Defense Labs
The Last Watchdog
The Last Watchdog
小众软件
小众软件
宝玉的分享
宝玉的分享
L
LINUX DO - 最新话题
美团技术团队
W
WeLiveSecurity
H
Hackread – Cybersecurity News, Data Breaches, AI and More
V
V2EX - 技术
Google DeepMind News
Google DeepMind News
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
The Blog of Author Tim Ferriss
Schneier on Security
Schneier on Security
O
OpenAI News
N
News and Events Feed by Topic
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Webroot Blog
Webroot Blog
G
Google Developers Blog
The Hacker News
The Hacker News
Cyberwarzone
Cyberwarzone
Blog — PlanetScale
Blog — PlanetScale
T
Tor Project blog
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
T
The Exploit Database - CXSecurity.com
I
InfoQ
SecWiki News
SecWiki News
Hacker News: Ask HN
Hacker News: Ask HN
Hugging Face - Blog
Hugging Face - Blog
Project Zero
Project Zero
T
Troy Hunt's Blog
C
Cisco Blogs
Last Week in AI
Last Week in AI
A
About on SuperTechFans
Microsoft Security Blog
Microsoft Security Blog

The New Stack | DevOps, Open Source, and Cloud Native News

Agentic development hinges on verification. For cloud-native software, that is a runtime problem. AI agents need infrastructure: Why Europe’s regional cloud strategy matters Transform your AI coding agent into a deterministic Java Spring expert WeAreDevelopers is coming to the US to give unsung developers a bigger voice Cleaner AI training data, fewer bugs: Sonar’s SonarSweep explained Observability overload is drowning engineers Google’s DiffusionGemma is 4x faster than its other Gemma models Fable 5: Guardrails and burn rate are annoying users, who say it’s still better than Opus 4.8 The Anthropic leader who built Claude Code says he ditched prompting — now he just writes loops. AWS can now mathematically prove your VMs are isolated Microsoft pulled 73 GitHub repos after malware attack — but still won’t say who’s compromised Databricks wants to kill the “email me a file” problem for AI agent skills Ramp bets forward deployed engineers can do what off-the-shelf finance AI can’t Git real: AI agents aren’t just for solo developers anymore Anthropic launches Claude Mythos/Fable 5, but you better try it soon This AI agent startup ditched Anthropic for DeepSeek — and says it’s saving millions When your data model is the bottleneck: lessons from Medium’s feature store How long before we stop reading the code? The tokenmaxxing party is over, and Revenium is mopping up How AI is solving the memory crunch it created Microsoft’s pitch to enterprises: Ditch Azure Repos for GitHub, despite its rocky reliability record Claude Code’s biggest upgrade yet ran 5 agents at once — here’s what happened Why Anthropic just doubled Claude Cowork limits at no charge For years, Apache Cassandra handed this work to your team — 6.0 takes it back “A dangerous combination”: The 2 factors that can “corrupt” AI agent workflows With Foundry, Microsoft bets the enterprise AI battle is about reliability, not capability Microsoft unlocks Visual Studio for developers left behind by its own AI AI teams now deploy 1,000 times a month. Your pipeline wasn’t built for that. Microsoft just made the agent runtime free — and kept everything around it “Whoever builds the most joyous product wins”: The agent war begins Netlify CTO Dana Lawson: Writing code is no longer the job From Jupyter Notebook to production: How to ship AI systems that actually work OpenClaw used Gavriel Cohen’s code and exposed the AI Agent accountability problem Replit shows how vibe coding is getting its own financial stack — and a path to profit Cloudflare aqui-hires VoidZero: Did a piece of the open web just stabilize, or become more brittle? Cursor cuts prices and adds enterprise spend controls amid “tokenomics” reckoning Google Gemma 4 12B nearly matches 26B benchmarks — and runs on your laptop Snowflake thinks it knows what’s really slowing developers down Autonomous agents have met their biggest challenge yet: The database. Why agentic AI makes the ops platform the most important layer in the enterprise How to dramatically improve enterprise security alert tuning to battle cyberattacks Why the need for humans won’t disappear in the age of autonomous databases How to secure Kubernetes in the age of AI workloads Asana says its new AI “chief of staff” turns your Slack chaos into trackable work Nvidia’s best model is now live Mate Security’s Asaf Wiener made every backend engineer a model router. He’s right to. The AI cost crisis finally has a watchdog — just not the companies causing it How to get operational data off the factory floor without creating an IT breach Why CPUs still matter in the age of AI agents Rayfin: Microsoft’s answer to the gap between vibe coding and enterprise production Microsoft bets the enterprise AI race will be won on data context, not model power “A successful attack could be catastrophic”: Anthropic gives more groups access to Claude Mythos How GitHub plans to win developers back Microsoft really, really, really wants developers to love Windows again With Intelligent Terminal, Microsoft is reinventing the Windows terminal Microsoft debuts “Scout” at Build, a new personal agent for work OpenAI’s Codex adds new tools — Sites, Annotations, more plugins — for knowledge workers GitHub Copilot’s usage-based billing is live: Here’s what you need to know OpenAI, Anthropic, Google, Amazon, and xAI all fail on type of attack, study finds JetBrains open-sources Mellum2 to go where Claude Code can’t Claude Code vs. Cursor vs. Codex vs. Antigravity — six months in This coding agent doesn’t want your feedback — it ships without it “Blowing things up”: The one move vendors got wrong on AI agents At Sapphire, SAP makes the case that enterprise AI is a context problem Gavriel Cohen found his own code inside OpenClaw, so he walked away AI retrieval at scale is becoming a systems problem, not a tooling problem The DIY platform trap that’s burning out engineering teams I tested Cursor’s new Jira integration and it’s 5 stars, no notes. Here’s why. Why GPT-5.4, Claude, and Gemini can’t agree on basic, real-world facts Replit’s vibe coding platform just got a Visa-backed identity layer for AI agents — and it changes how agents spend money Opus 4.8 Made Claude Smarter. Token Discipline Got Urgent. Why Linux creator Linus Torvalds gets angry hearing “99% of code is AI” Vendor neutrality isn’t magic: A hard look at the OpenTelemetry ecosystem “The AI did it” won’t save you when EU regulators come knocking The fix for soaring AI cloud bills exists — so why won’t we trust it? AI is shipping code faster than security was built to handle Why AWS scrapped OpenSearch’s architecture to chase agent workloads Claude Opus 4.8 is here: effort controls, dynamic workflows, cheaper fast mode, better honesty, less deception Percona celebrates 20th birthday with new foundation — and a goat cake Why OpenAI and Anthropic are hiring forward deployed engineer teams Claw-style AI agents are coming to the enterprise. The governance infrastructure is still catching up. The agentic identity crisis: Why your security isn’t ready for the AI revolution Debugging the undebuggable: building observability into probabilistic AI systems Snowflake commits $6B to AWS as it pushes deeper into AI Why MotherDuck refuses to fork DuckDB Researcher “gave Claude Code ‘ADHD’… and it thinks 2x better now.” Outside experts want more proof. “There is no accountability”: AI coding agents are installing packages no one owns “Tokenmaxxing is real, expensive & it’s spreading”: AI budgets are exploding With Google’s debut, the most important AI agent feature is now the most boring one Why AI agents need a Context Lake Google ranks the best AI for building Android apps, and the winner isn’t Gemini Google pushes Pro, Ultra, and free users from open-source Gemini CLI to closed-source Antigravity CLI The reason enterprise outages almost never start where ops teams think Taming the agentic influx: a blueprint for AI business observability How the AC/DC framework helps teams govern AI coding agents GitLab 19.0 trades its string section for a full DevSecOps orchestra Who’s monitoring the agents? How Jaeger hit 8.6× compression on 10 million spans with ClickHouse What ClickHouse learned from a year of coding with AI agents OpenClaw passed 300,000 GitHub stars. Then Google launched Spark.
We’ve been measuring AI wrong; why economically valuable work is the new benchmark
Adrian Bridgwater · 2026-06-15 · via The New Stack | DevOps, Open Source, and Cloud Native News

As the AI industry gradually builds standardization guidelines and systems, such as those overseen by the Tokenonmics Foundation, the need for a wider set of validated yardsticks by which we can measure the worth of any given model continues. 

Nvidia recently pointed to AgentPerf from Artificial Analysis as a hardware benchmark for developers to compare systems for agentic AI. Models typically also list an MMLU benchmark score, also from Artificial Analysis.

But while pure performance is nice (if not essential) to have, software engineers and their business counterparts will ultimately want benchmarking tools that are calibrated to real-world business use case effectiveness.

Can agents perform economically valuable work?

A new benchmark surfaced on Thursday last week to introduce Agent’s Last Exam (ALE), an agentic AI scoring measure based on an evaluation of Fable 5, GPT-5.5, Composer 2.5, and a selection of other frontier agent systems. The analysis measures whether AI agents can actually perform useful and effective work in real terms across 55 real world occupations and 1,500+ real world tasks

Leading the research group behind the project is Dawn Song, professor and doctor of philosophy in computer science at the University of California, Berkeley.

Song tells The New Stack that her group grounded Agent’s Last Exam on “economically valuable work” in the real labor market, rather than on some abstract benchmark design. 

We’ve been measuring AI wrong

“Everyone wants to know when AI agents will become job-ready,” Song says. “The problem is that we have not been measuring what’s needed to answer this question. Every task in ALE originates from work that a domain expert actually performed in a business, production, or research setting.”

In many cases, ALE asked how long the task took and what level of expertize it required, allowing the tool to estimate the labor value associated with completing it. In that sense, she says, this is not inventing some arbitrary notion of value – it is evaluating work that organizations already pay people to do.

“Businesses do not hire people to solve [math-based] benchmark questions. They hire people to perform real-world work. As agents become increasingly capable, evaluating real-world work is no longer optional, it’s tablestakes.” – Professor Dawn Song.

Most benchmarks evaluate isolated skills: answering questions, solving math problems, writing code snippets, or navigating toy environments,” explains Song. “Businesses, however, do not hire people to solve such benchmark questions. They hire people to perform real-world work. As agents become increasingly capable, evaluating real-world work is no longer optional – it’s tablestakes.”

Song’s group found the results of initial ALE analyses both “impressive and sobering”, largely because today’s agents can solve a “meaningful fraction” of professional tasks, but clearly have limitations.

When the team looked at the hardest work tasks that require sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE’s hardest tier, every frontier agent tested, including Fable 5, achieved a 0% success rate.

“Evaluating agents based on economically valuable work provides a common language for comparing progress across systems and understanding where AI can genuinely augment or automate human labor,” says Song. 

Economics is only one dimension of value

That said, she underlines that economic value is “only one dimension of value” i.e. for many operational and professional tasks, labor time and expertize provide a “reasonable proxy” because compensation is closely tied to the work being performed. But, she explains, there are important domains where this breaks down. 

Research is a good example: some projects may consume years of effort and produce little impact, while a single breakthrough can create enormous value. In those cases, hours worked and wages paid are poor measures of ultimate contribution.

The ALE team has surmised that “there is no universally best agent” and every frontier model, including Fable 5, has domains where it shines and domains where it struggles. Song has said that the real signal lies in where agents succeed, where they fail, and how those patterns differ across domains.

Mix of models is a mindful maxim

On identical tasks, different models often fail for very different reasons, so does she advocate a mix of models as the most prudent approach to adopt?

“In the near term, yes. If a software engineering team is deploying agents in production today, using a mix of models is often the most practical approach. Different frontier models have different strengths and cost-performance characteristics, and routing tasks to the model that performs best at a certain cost for a given domain is simply good engineering,” Song clarifies.

She further notes that one lesson from ALE is that performance varies significantly not just across models, but across occupations and task types. That makes model diversity particularly valuable today. The question is not which model is best overall, but which model is best for a given class of economically valuable work.

“The age of useful agents is here. The age of truly job-ready agents is not.”

For scenarios where an agent only operates in the terminal, the group has also released ALE-CLI, a CLI-only subset of the benchmark. The research group behind ALE is a mix of PhD students and postdoctorates. Song is also director of the campus-wide center Berkeley Center for Responsible Decentralized Intelligence (RDI), which has led the behind Agent’s Last Exam. 

“The age of useful agents is here. The age of truly job-ready agents is not,” stated Song.

The hope is that ALE will serve as a “new guidepost and north star” for developing agents capable of reliably performing economically valuable work across a broad range of domains

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.

Created with Sketch.