惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

Hacker News - Newest: "AI"

Ask HN: We need a standard way to say how much AI was used in a PR Anthropic, Microsoft in talks for AI chip deal after $5 billion investment Idea: Subreddits as curator blogs for the AI era The elephant in the room • Josh W. Comeau What Happens When AI Edits a Classical Chinese Academic Paper: What Happens When AI Edits a Classical Chinese Academic Paper / 当AI修改古汉语学术论文时发生了什么 China's AI optimism isn't what it seems Ask HN: How much AI is in your writing? wwwatch · AI intel for builders Diia - Ukraine gov app launched AI agent based on Google Gemini The IPO wave will enshrine the AI gods' control over the future We shipped 30 tools to our agent. The most-used one just reads our documentation. - kapa.ai - Instant AI answers to technical questions How we work: AI skills - Easy Cyber Protection Governor Newsom signs first-of-its-kind executive order to prepare workers and businesses for potential AI disruption | Governor of California Another California tech company lays off thousands - Los Angeles Times How the AI backlash could cost investors AI Has a Memory. It Just Doesn't Know What to Remember The Companies Cutting Headcount for AI Will Lose to the Ones Who Didn't Ask HN: Is there a better and more affordable AI coding tool than Claude? Food for Agile Thought #545: R/L Agentic Chaos, AI Killed the Agile Industry The current AI pricing was always going to go away A top K-drama star faces explosive backlash over AI-manipulated voice evidence Clickup mocks employees over AI 8 days before layoff Automated Expert Extraction: Behavioural Telemetry of Nyx Wave Ban on Authors Who Submit AI Content “Welcome but Unenforceable” Hollywood in the 60s and the Good AI Future — Joel Dueck Proton Pass for AI Agents Baby Magic-AI Baby Image & Video Generator Online Interactive AI Chat - Chrome 应用商店 Google I/O showed how the path for AI-driven science is shifting Google makes Gemini 3.5 Flash the default AI model for billions of users - Tech Three Dots AI didn't kill your junior pipeline. You did | Andrew Murphy Adobe, Canva, CapCut Are Coming to Gemini to Help You Edit AI Creations "Erase," an AI tool that can remove unwanted objects from images Steve Wozniak cheered after telling students they have AI – actual intelligence AI-Assisted Engineering Habits Worth Stealing (Week 2 Roundup) The best engineers in 2026 aren't the best coders. They're the best at not trusting AI code. GitHub - Woodman97/lucy-agent: AI agent for writing, research, code, DeFi & blockchain. Pay per task in USDC on Base or Solana. A2A + MCP + x402 protocols. $200/month per developer on AI tools. Most companies can't explain what they're getting. Spotify and UMG Announce Licensing Deal to Allow for AI Covers and Remixes CodeAlta After Automation Acrisure layoffs to number 2,250, attributed to AI advancements Report Alleges Chinese Influence Behind AI Data Center Pushback in the U.S. Pressure from Silicon Valley helped block Trump’s expected order on AI AI may be inflationary before it becomes productive Cisco used AI to write security incident reports, with mixed results PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications GitHub - ai-mf/media-engine Ask HN: What the Best AI for Coding? Meet Hell Grind, The First Feature Film "Created Entirely On The Higgsfield AI Platform" Navigating AI with paper maps The Unsustainable Subsidy An Uncharitable Taxonomy of the AI Discourse ReCardEx — AI Product Photography for Marketplaces White House yanked AI order after David Sacks raised industry concerns Best Practices to Produce Maintainable Code with AI [video] AI Slop & the Vulnerability Treadmill Crypto and AI-Funded Super PACs Are Metastasizing The AI Bubble — No One's Happy Lam Research focused on adding AI to chipmaking tools as it eyes US expansion Donald Trump abruptly postpones AI order after White House infighting Tell HN: I'm tired of AI-generated answers Design prompting: describe the world, not the widget AI Local Recorder App - App Store erlang_python — erlang_python v3.0.0 Outlier AI is paying cardiologists to review ECGs and train AI models (referral) Agentic Engineering Memory — A Memco Field Guide Igor Babuschkin Seeks Up To $1 Billion For River AI AI is killing the cheap smartphone web-ai-sdk · Building blocks for the Web's built-in AI China unveils 'world's first' underwater data center — 2,000 server facility is powered by offshore… AI for Solo Founders: Virtual Coffee Chat & Networking - #BosTechWeek | Partiful The Structural Barriers to AI Lawyers Roundtables: Can AI Learn to Understand the World? Spotify and Universal Music agree deal to let subscribers create AI remixes AI Tokenomics: How to Profitably Turn Tokens into Business Value [video] AI-assisted engineers are burning out, is this fine?—Martian Chronicles, Evil Martians’ team blog Trump pulls back AI order over fears it could slow US technology | AP News GitHub - simd-ai/agent Spotify and Universal Music strike deal allowing fan-made AI covers and remixes Best AI Audiobook Maker | Warblize dhrive: Squarespace for mobile apps GitHub - fireharp/coherence: Git-native drift detector for agent-assisted repos: catch stale docs, ADRs, tests, metrics, and generated artifacts. The AI has come for my code - The Boston Diaries Show HN: Synrix: hardware-verified memory routing for edge AI agents Starbucks scraps AI inventory tool across North America GitHub - bjcoombs/ai-native-toolkit: Claude Code configuration and customizations GitHub - VenturFlow/Assay Tanya Janca on AI Slop, Vibe Coding, & the Future of AppSec Ask HN: What is an optimal game theoretic response to AI adoption? Ask HN: What AI prompts have you found most reliable for actual work? White House postpones AI executive order signing ceremony Trump Postpones AI Executive Order Due to Concerns About Overregulation Show HN: Canonry tracks how AI cites you – agent-first, open source AMD Ryzen™ AI Halo for AI Developers I had to do therapy on my AI — Tin's Posts — Tin Marković Ask HN: Anyone else struggling with AI and work? Google quietly nerfed its AI Pro plan, and here’s what you get now Grok falls flat in Washington, undercutting SpaceX's AI growth story Why the Amish Are Falling in Love With AI
Observations on AI agent token consumption
speckx · 2026-05-18 · via Hacker News - Newest: "AI"

A new paper from researchers at Stanford, Michigan, DeepMind, All Hands, Microsoft AI and MIT is the most detailed open empirical study I’ve seen of how AI agents actually spend tokens at scale1. The authors run eight frontier models across 500 SWE-bench Verified tasks with four runs each, capturing full trajectory telemetry decomposed by token type, phase and action. They release the dataset alongside the paper, which is to my knowledge the most granular public corpus of agentic trajectories currently available.

The paper is rigorous, careful about what it claims and puts hard numbers on questions that have until now only been answered with anecdotes. I’d recommend reading it in full.

What follows is a walk through four of the paper’s observations, interleaved with what we are seeing at Flowstate from the exact same patterns surfacing in customer environments. We sit in the request path between the user and the AI provider, which means we observe the same trajectories the paper analyses, but in production, across a much broader set of AI tools than SWE-bench covers.

The two sets of observations are remarkably close. The researchers measured it on a benchmark; we see it on customer devices. The agreement between the two is what makes this paper so useful for anyone trying to actually manage this spend.

Input tokens dominate agentic spend

The paper’s first finding is that agentic coding consumes around 1,000 times more tokens than equivalent code-chat or code-reasoning tasks, with an input-to-output ratio of roughly 153:1 (against 1.33 for chat and 0.16 for reasoning)2.

The reason is structural. Agentic workflows accumulate context across rounds, and the same content is fed back into the model on every single turn. Token caching helps at the margins, but the sheer volume of accumulated context dominates the cost.

This is the exact pattern we see in non-agentic AI usage as well. Chat-style usage of Claude, ChatGPT and similar tools follows the same shape because users continue conversations across days rather than starting fresh sessions with explicit context. One customer described it to us this way:

“We think they’re creating PowerPoints, and then they’re like, ‘change this word on slide three’, and then they’re just continuing to generate these really large documents.”

That is the paper’s finding in human form. A chat session that should have been a fresh prompt becomes a thread that re-pays for its entire history on every turn. The user thinks they are making one small edit. The model is being asked to re-process the entire document. The vendor charges accordingly.

The implication is that a massive share of controllable AI cost sits upstream of the model. Better prompts. Fresh sessions. Explicit context provided once, rather than constructed iteratively over an afternoon. The agent’s behaviour is largely a consequence of how it was set up.

Model choice produces order-of-magnitude cost differences

On the 230 SWE-bench tasks that every tested model successfully solved, Kimi-K2 and Claude Sonnet 4.5 used on average 1.5 million more tokens than GPT-53. Same problems, same correct answers, vastly different token appetites.

The paper is careful to rule out the obvious explanation: the cost gap persists on both the shared-success subset and the shared-failure subset. The more expensive models were not tackling harder problems. They were simply spending more tokens on the same problems.

This matches a behaviour we observe consistently. Users default to whichever model is most prominent in the UI, and “most prominent” typically means most expensive. Opus when Sonnet would have done the job. Vendors have no commercial incentive to route users toward cheaper models. From another customer conversation:

“We definitely know that people are using just all Opus. The people that are using up their tokens, they’ll continue to do that unless there’s a way to control it. We did not know there was a way to control that in Claude. I know there isn’t.”

There is a way to control it, but it doesn’t live in the vendor’s product. The natural place for it is the layer that can see the task category and route at the request level: boilerplate to the leaner model, long-form planning to the heavier one. The Stanford finding that token efficiency is a property of the model rather than the task is precisely what makes routing viable. If heavier models only burned more tokens on harder problems, routing would be useless. They don’t, so it isn’t.

Token usage is highly variable and difficult to predict

The paper’s third observation is that four runs of the same model on the same task can produce up to 30x variance in total token cost4. The most expensive run on a given problem costs roughly twice the cheapest run on average. As cost goes up, predictability goes down.

More pointedly: the authors test whether agents can predict their own token usage before executing a task. They found correlations of at best 0.39. All eight models systematically underestimate5. Even the agent does not know what a task will cost.

What we see on the customer side is leadership trying to manage spend with the only data available to them. Usually a chat count from a vendor admin dashboard:

“What are these five people doing? They’re always saying they don’t have enough tokens.”

A chat count does not answer this. A token count answers how much was spent, but not why. The “why” is structurally invisible from the invoice. You can only see it at the request layer, where the actual work is observable. No amount of upfront forecasting will close the gap, because the work itself is stochastic.

Higher cost does not deliver higher accuracy

The paper segments runs into cost quartiles and finds that accuracy peaks at the second-cheapest quartile and plateaus from there. The most expensive runs do not deliver better outcomes than modestly priced ones6.

The authors trace this to a specific behavioural pattern: in the highest-cost quartile, repeated file modifications are roughly 4x more frequent than in the cheapest quartile, and repeated file views are 2x as frequent7. The expensive runs are not doing more work. They are doing the same work, again, on the same files.

The paper politely describes this as “unproductive exploration rather than deeper reasoning.” We see the same shape in non-coding AI usage. Repeated regeneration of the same artefact with marginal changes. Long-running sessions where the user disengaged hours ago. Identical prompts re-issued after a typo correction. None of these are agent failures; they are user-driven patterns the agent inherits.

The measurement gap

None of the patterns above can be addressed at scale without measurement at the request layer. Vendor dashboards aggregate by tool and tenant. AI gateways (a proxy that sits between you and your AI provider) cover server-side production routing. Engineering effectiveness tools cover narrow coding assistants and stop there.

We built Flowstate because the measurement layer required to actually act on these patterns didn’t exist anywhere in the stack.

Flowstate observes every AI call a user makes, whether it’s ChatGPT in the browser, Claude Code in the terminal, Midjourney for images or Suno for audio, and ties each call back to a user, project, model and cost class8. Customers keep their own contracts and their own API keys with every provider they use. We don’t sell access to AI and we don’t restrict which tools people can reach for.

That architectural position has consequences beyond cost measurement. The same instrumentation that surfaces token wastage also surfaces patterns that matter for security. Prompts containing customer PII heading out to a consumer AI tool. Source code pasted into ChatGPT. Employees running side projects on the company subscription. We see these in the field solely because the request layer is the only place they are visible.

The Stanford paper makes a clean economic case from a benchmark. Our observations make the exact same case from real corporate environments. The patterns driving AI cost are large, measurable and consistent. You just need the plumbing to see them.


Footnotes

  1. Bai, L., Huang, Z., Wang, X., Sun, J., Mihalcea, R., Brynjolfsson, E., Pentland, A., and Pei, J. (2026). How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks. arXiv:2604.22750v2. The authors acknowledge concurrent work on token distribution in multi-agent systems (Salim et al. 2026; Wang et al. 2025) and pricing dynamics in reasoning models (Chen et al. 2026), but the combination of scale, granularity and open data release makes this paper the most useful one I’ve seen for understanding what agentic spend actually looks like. The authors also publish a project website with the trajectory dataset, an analysis code repo for replicating the figures, and a fun interactive Can You Guess the Token Cost? game that drives the paper’s headline finding home in about thirty seconds.

  2. Bai et al., Figure 1. Agentic coding averages 4.17M tokens per task and $1.86 in cost, against 3.39k tokens for code-chat tasks and 1.19k tokens for single-turn code reasoning. The 1,000x figure is the ratio against reasoning; against chat it is roughly 1,200x.

  3. Bai et al., Figure 6 and Section 4. Section 4 specifically addresses the “harder tasks naturally cost more” objection by showing the gap persists on the shared-success subset (n=230 tasks solved by every tested model). The authors describe the difference as “model-specific behaviour rather than intrinsic task difficulty.”

  4. Bai et al., Figure 2a and 2b. Up to 30x variance across instances; on the same task across four runs, the most expensive run costs roughly 2x the cheapest on average.

  5. Bai et al., Figure 10 and Figure 11. Best correlation across all eight models is 0.39 (Claude Sonnet 4.5, output tokens). Input-token prediction is uniformly worse than output-token prediction. Every model underestimates systematically; Figure 11 shows predictions clustering well below the diagonal across the board.

  6. Bai et al., Figure 3b. Accuracy increases significantly from the cheapest to the second-cheapest quartile, then plateaus. The third and fourth quartiles are not statistically distinguishable from the second.

  7. Bai et al., Figure 4 and Appendix A. Mixed-effects regression coefficients of roughly 4x for repeated modifications and 2x for repeated views at the highest-cost quartile, both significant at p < 0.001 against the minimum-cost group, controlling for model identity. Output-token analysis in the appendix shows the same pattern.

  8. Flowstate. I co-founded it, so the obvious conflict-of-interest disclosure applies.