惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recent Announcements
Recent Announcements
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
博客园 - Franky
D
Docker
H
Help Net Security
S
SegmentFault 最新的问题
AWS News Blog
AWS News Blog
P
Palo Alto Networks Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
雷峰网
雷峰网
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
L
LangChain Blog
Attack and Defense Labs
Attack and Defense Labs
The Last Watchdog
The Last Watchdog
小众软件
小众软件
宝玉的分享
宝玉的分享
L
LINUX DO - 最新话题
美团技术团队
W
WeLiveSecurity
H
Hackread – Cybersecurity News, Data Breaches, AI and More
V
V2EX - 技术
Google DeepMind News
Google DeepMind News
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
The Blog of Author Tim Ferriss
Schneier on Security
Schneier on Security
O
OpenAI News
N
News and Events Feed by Topic
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Webroot Blog
Webroot Blog
G
Google Developers Blog
The Hacker News
The Hacker News
Cyberwarzone
Cyberwarzone
Blog — PlanetScale
Blog — PlanetScale
T
Tor Project blog
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
T
The Exploit Database - CXSecurity.com
I
InfoQ
SecWiki News
SecWiki News
Hacker News: Ask HN
Hacker News: Ask HN
Hugging Face - Blog
Hugging Face - Blog
Project Zero
Project Zero
T
Troy Hunt's Blog
C
Cisco Blogs
Last Week in AI
Last Week in AI
A
About on SuperTechFans
Microsoft Security Blog
Microsoft Security Blog

Hacker News - Newest: "AI"

AI can't read an investor deck AI as an attorney? Student uses ChatGPT, Gemini to sue UW over alleged racial discrimination Hacking MCP Servers in AI Systems – The Rug Pull: Tool Changes After Approval GitHub - MeepCastana/KubeezCut: Free Web based video editor GitHub - GenAI-Gurus/awesome-eu-ai-act: Curated tools, official sources, OSS, templates, and guides for EU AI Act compliance. Can AI judge journalism? A Thiel-backed startup says yes, even if it risks chilling whistleblowers Coming soon: 10 Things That Matter in AI Right Now DARPA built an AI to fact-check enemy weapons claims What explains heterogeneity in AI adoption? When AI Meets Muscle: Context-Aware Electrical Stimulation Promises a New Way to Guide Human Movements - Department of Computer Science AI Changed How We Build. It Did Not Change What Matters. Linux rules on using AI-generated code - Copilot is OK, but humans must take 'full responsibility for the… Meta spins up AI version of Mark Zuckerberg to engage with employees Code Mode: Let Your AI Write Programs, Not Just Call Tools | TanStack Blog GitHub - Delavalom/graft: Go framework for building AI agents. Type-safe tools, multi-provider (OpenAI, Anthropic, Gemini, Bedrock), zero vendor SDKs. India's TCS tops estimates, says new AI models did not dent services demand Gen Z's fading AI hype Strong feeling: we are in a folded AI reality GitHub - machinarii/total-recall-catalog: A reference catalog of latest knowledge retrieval, memory & RAG systems GitHub - mensfeld/code-on-incus: Give each AI agent its own isolated machine with root, Docker, and systemd. Active defense detects and stops threats automatically.. Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks GitHub - immartian/bellamem: Persistent belief-graph memory for AI agents. Retrieves decisive context by importance — not recency, not RAG, not /compact. recursive-mode: The Repo-Native Operating System for AI Engineering After the attack on Sam Altman's home, will AI CEO's go on the offensive? The biggest advance in AI since the LLM Opus 4.6 vs GPT 5.4 One Prompt Unity World Generation Test “AI polls” are fake polls Client Challenge Can AI be a 'child of God'? Inside Anthropic's meeting with Christian leaders How to Switch AI Chatbots and Why You Might Want To GitHub - MattMessinger1/agentic_refund_guardrail: Safe refund policy layer for AI agents — Python + TypeScript. Same behavior, shared tests. Adam/papers/emergent_values_whitepaper.md at master · strangeadvancedmarketing/Adam Ask HN: How do you stop playing 20 questions with your AI coding tools How far can automation and AI support psychotherapy? - @theU GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits A Mac Studio for Local AI — 6 Months Later A History of the Early Years of AI at the University of Edinburgh Why AI Coding Tools Still Feel Stuck on Localhost MSN AI Datacenters Are Becoming Strategic Targets twitter.com Penn Researchers Use AI to Surface Unreported GLP-1 Side Effects in Reddit Posts Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 AI models are terrible at betting on soccer—especially xAI Grok GitHub - xialeistudio/echoic GitHub - HimashaHerath/github-dev-wrapped: AI-powered weekly GitHub activity reports deployed to GitHub Pages GitHub - alejandrobalderas/claude-code-from-source: Architecture, patterns & internals of Anthropic's AI coding agent — reverse-engineered from source maps AI and Tech brief: Ireland ascendant GitHub - Titovilal/context0: Context0 - Never Surrender Training for a Marathon with an AI Coach: What Worked and What Didn't Cyber Pulse: Agentic Intel - Apps on Google Play I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs Gen Z workers are so fearful AI will take their job they’re intentionally sabotaging their company’s AI rollout | Fortune How AI Is Reimagining the Game of Golf–For Both Players and Courses GitHub - nattergabriel/reseed: A CLI tool for managing and distributing agent skills across projects Is SVG the final frontier? My AI workflow evolved from prompts to a near-autonomous workflow MLSharp Help - 3DGS Viewer & Generator I put my cognitive field based AI's runtime on GitHub Is Numble the first AI-proof game? A3: Kubernetes for autonomous AI agent fleets | Emergent Principles Deepali Vyas ("The Elite Recruiter") GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Unionized ProPublica staff are on strike over AI, layoffs, and wages Unleashing the Advantage of Quantum AI We're heading for an AI-fueled 'dementia crisis,' brain scientist warns The AI-Assisted Breach of Mexico's Government Infrastructure [pdf] GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. MSN GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way What leaked "SteamGPT" files could mean for the PC gaming platform's use of AI AI is the boss at this retail store. What could go wrong? GitHub - Wuzu11517/agentic-proxy: Local proxy meant to help reduce With Drones, Geophysics and ArtificiaI Intelligence, Researchers Prepare to Do Battle Against Land Mines A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - inevolin/resume-cli: Hit Claude usage limits? Resume any AI coding session elsewhere. Switch tools at zero friction. GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript This Startup Wants You to Pay Up to Talk With AI Versions of Human Experts Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949 WordPress 7.0: The Good, the AI, and the Still Missing AI on the couch: Anthropic gives Claude 20 hours of psychiatry IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures AI Agents Know About Supabase. They Don't Always Use It Right. The history and future of AI at Google, with Sundar Pichai Inside an AI‑enabled device code phishing campaign How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines AI for Systems: Using LLMs to Optimize Database Query Execution Forecasting the Economic Effects of AI Introducing Tinker: Play with AI, bring your ideas to life AI sheds light on an ancient gaming mystery People really hate AI but not as much as Iran—or Democrats | Fortune What is an AI Product Engineer? Phoebe Gates wants her $185 million AI startup to succeed with 'no ties to my privilege or my last name': 'I have a chip on my shoulder' | Fortune
Estimating the Productivity of an Autonomous AI Software Engineer | Cognition
weltview · 2026-06-15 · via Hacker News - Newest: "AI"

Six months ago, every CTO was concerned their team wasn't using enough tokens. That trend has reversed as token usage and AI spend have skyrocketed. Engineering leaders are now trying to figure out how to measure actual output, because not every token delivers real value. Some save engineering hours and accelerate projects; others are wasted on useless sessions and bad prompting.

For any organization of sufficient scale, it's essentially impossible to measure value over thousands of sessions & billions of tokens. We set out to automate this with AI.

Predicting real ROI is hard, for reasons we’ll go into later. So we focused on a key sub-task: estimating how many productive engineering hours each Devin session is worth.

LLMs are notoriously bad at time estimates — so we expected this to be a struggle. After carefully tuning the system, however, our model has an rlogr_{log} of 0.740.74 and appears to be unbiased. Individual predictions aren’t perfect, but the model is good enough to be used for estimating aggregated totals. They’re also convertible to dollar amounts using engineering salaries, getting us closer to business value.

In our system, an agent reviews each completed Devin session — first classifying whether it produced useful output, then estimating how long a human engineer would have taken to produce the same work. We validated it by asking human engineers how long they would have spent on the same tasks.

The system is now running with customers. To our knowledge, this is the first automated system measuring AI engineering productivity in production.

Choosing a Metric

How did we land on productive engineering hours as our metric? The first question we needed to answer was what to estimate. Ideally, we'd measure dollar impact directly, such as revenue attributable to features shipped or costs avoided by bugs fixed. In practice, this is still an unsolved problem in our field. It’s incredibly hard for an engineer to know how many dollars of business value they created through the PRs shipped last week.

On the other end of the spectrum, we could measure raw activity: lines of code, commits, PRs, tokens consumed. These are easy to collect but don't correspond to effort. A mechanical refactor can touch thousands of lines in an afternoon; a two-line bug fix can represent hours of investigation. Many valuable tasks — triaging bugs, running analytics queries, reviewing code — produce no code at all.

The middle ground we decided to measure is human engineering hours: how long would a human engineer have taken to produce the same output? Hours are already how organizations value engineering work — salaries and contractor rates are denominated in time. When leadership evaluates an investment, they think in terms of time and cost savings. Hours are standardized across organizations, independent of business context, and convertible to dollars via engineering rates.

But not all hours are equal. For example, if all PRs created by a session were closed, it likely wasn’t valuable. We wanted to measure only productive engineering hours. So, we also had to build a system to classify whether the sessions were actually productive.

Collecting a Dataset

We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258258 sessions from 126126 users across a diverse set of enterprise customers. We collected the data via live interviews and a survey.

Every Devin session has a full execution trace: the user's request, every action taken, the resulting code, codebase context. This gives us a record of production engineering work at a level of detail that is difficult to obtain from surveys, aggregate activity metrics, or open-source benchmarks alone.

In the charts below we analyze our dataset. Our dataset consists of a distribution representing real enterprise workloads, spanning a variety of languages, frameworks, session types, and hour estimates.

Filtering for Useful Work

As we reviewed our dataset, we realized that not all sessions correspond to useful work, and that we would need to build a classifier to filter out unproductive sessions.

For sessions with a PR, this is relatively straightforward: if any PR from the session is merged, we include the estimate; if not, we discard it. This is slightly lossy; sessions with all closed PRs can still have delivered productive work, but we wanted to err conservative.

Sessions without a PR are more complicated. We built a classifier to filter out unproductive sessions, which removes around 1−20%1-20\% of sessions, depending on the customer. Many non-PR sessions are genuinely productive — e.g., finding unused dependencies, scanning for security vulnerabilities, reviewing a pull request, running analytics queries, triaging a bug — and we retain those. We discard sessions where the agent lacked access to carry out the task, sessions where the agent asked for clarification and the user never replied, and other scenarios where Devin was unable to meaningfully advance the task.

Building the Estimator

After we identified which sessions were productive, we needed to actually estimate their equivalent engineering hours. To do this, we built an estimator agent with two key components: context and prompts.

On the context side, we included as much information about the session as possible — the user's messages, the PR produced (if applicable), the full agent trace (viewing logs, tracing through code, fixing lint, etc.), and additional codebase context from DeepWiki.

On the prompt side, we set aside 2525 sessions as a development set for iteration. By manually triaging agent runs on these sessions and reasoning critically about failure modes, we arrived at the following design principles: credit only the work Devin actually saved and compare Devin against a conservative human reference. Concretely, we:

  • Reason about the human's path instead of the agent's. The agent's own trajectory is sometimes a poor proxy for human effort. Agents take detours, recover from environment and setup failures, and produce artifacts like summary reports that a solo engineer wouldn't. The estimator reasons about the path a human would have taken to the same output, discounting things like retries, environment setup, and artifacts the agent produced along the way.
  • Credit only the work the user did not specify. The same output can represent very different amounts of effort depending on how much the user already figured out before asking Devin. The estimator looks at what the user actually said to determine how much of the problem Devin had to solve on its own. For example, if a user comes with a bug report and no proposed fix, we include the time to triage the bug, whereas if the user comes with an implementation plan, we only count the implementation time.
  • Account for codebase familiarity. The same task can take minutes or a day depending on how well the engineer knows the code. Users told us this is one of Devin’s clearest advantages: tasks in unfamiliar or legacy codebases that would cost a developer a day of ramp-up are often delivered quickly. The estimator infers the relevant level of familiarity from what the session reveals. When the user asks how parts of the system work, it includes the exploration time that orienting in the code would have required. When there's no signal either way, it assumes typical familiarity: an engineer who knows the high-level architecture but hasn't memorized every function.
  • Assume relevant expertise. A recurring theme in our interviews was that the agent lets people do things they could not have done on their own — a backend engineer delivering work that would previously have required frontend and data-science colleagues. Crediting that cross-disciplinary reach would inflate estimates, so we conservatively assume the reference engineer already has the expertise the task demands. This understates the effort in many cases where a human would first have had to learn an unfamiliar language or framework.

Evaluation

On the held-out evaluation set of 233233 sessions, our estimator has an rlogr_{log} of 0.740.74 and rlog2r_{log}^2 of 0.540.54. The correlation is highly statistically significant (F(1,231)=279.9F(1,231) = 279.9, p<10−5p < 10^{-5}). Around 50%50\% of sessions fall within a factor of 22 of the true estimate. Individual estimates are noisy — 2233× errors in either direction are common — but because errors are roughly unbiased and independent, they cancel as session count grows and the aggregate converges toward the human-reported total.

A lot of the noise comes from variance between users, both in how they estimate and in genuine differences in speed. Roughly half the residual disagreement lies between users rather than within a user's own sessions ( ICC=0.58ICC =0.58). We considered per-user calibration, for example, prompting users in-product to give a few estimates for bootstrapping. We decided against it for simplicity and since, for our purpose of aggregation, estimating relative to an "average" user is sufficient.

Our initial, uncalibrated model consistently underestimated. To correct this, we fit a linear regression in log-space: h=2.28×m0.923h = 2.28 \times m^{0.923}. This is close to a constant multiplier, with a slope slightly below one. Constraining the slope to one (a single multiplicative constant of 2.082.08) changes every metric by at most 0.010.01. Residuals by bucket show no systematic trend after calibration.

Even after this correction, the total of the human estimates remains 1.4×1.4\times the total of the corrected model estimates. This gap is expected: an estimator that is unbiased in log-space becomes biased once its predictions are summed in linear space, systematically underestimating the total. To see why, consider a simplified estimator that is equally likely to be off by a factor of two in either direction. When it predicts 22 hours, the true value is equally likely to be 11 hour or 44 hours, giving an expected value of 1+42=2.5\frac{1+4}{2} = 2.5 25%25\% above the prediction. Summing such predictions therefore underestimates the true total. Rather than apply a further correction for this, we report the unadjusted figure as a deliberately conservative underestimate.

We also tested simpler predictors to understand how much of the signal comes from the final code change versus the full Devin session. The first regresses a single scalar, the total lines changed (additions + deletions summed across all PRs in the session), against our human estimates. It performed poorly, with an Rlog2R^2_{log} of 0.270.27, confirming that code volume is a weak proxy for engineering effort. We then evaluated an estimator agent given only the trace of the agent's edit tool calls as context, with no user messages or other session activity. It performed better, but still lagged the full estimator, suggesting that important signal lives outside the diff.

These results match how engineering work actually happens. The effort in a session often comes from investigation, diagnosis, environment setup, reasoning through tradeoffs, or producing useful non-code outputs. Those signals are visible in the session trace (user messages, actions taken, intermediate observations, and codebase context) but not always in the final code change.

The regression results below are for the 129129 sessions in our evaluation dataset for which we have line-count data (the number of lines added and deleted across the session's pull requests). A total of 170170 sessions in our evaluation dataset created PRs, but our integrations with some of our customer's git hosting platforms do not capture diff statistics.

Comparison to Prior Work

Two recent studies have also used LLMs for effort estimation, both positively confirming the feasibility of this approach. METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 77 METR technical staff. On 3434 sessions labeled on human ground truth, their estimator had an rlogr_{log} of 0.830.83 . Our rlogr_{log} is lower likely due to our data being collected from a much more diverse set of users.

Anthropic (2026) estimated task duration on 1,0001,000 open-source Jira tickets using Claude, but the estimator only had the ticket title and description to work with. They had an rlogr_{log} of 0.460.46; human developers estimating the same tickets reached rlog=0.67r_{log} = 0.67. Our system establishes stronger correlation than Anthropic’s ( rlog=0.74r_{log} = 0.74 vs 0.460.46 ) because we have far more granular data per session. As we’ve shown in our evaluation experiments, that significantly improves the accuracy.

Threats to Validity

  • Ground truth bias. Our ground truth is self-reported. In our video user interviews, users were aware they were talking to Cognition, which might bias them.
  • Sampling. Users who voluntarily responded may skew toward more engaged users. We asked for "representative" sessions, so abandoned or failed sessions might be underrepresented. Nonetheless, since we filter unproductive sessions before aggregating, the evaluation distribution should be closer to the filtered production population than to raw traffic.
  • Hours are not business value. We measure engineering capacity, not whether that capacity was deployed on high-value work. One hour spent fixing a critical production bug has very different business value than one hour spent on a project that eventually gets canceled. We also don't capture second-order effects, like freeing engineers for higher-leverage tasks.
  • Hours don't account for quality. If the agent introduces a subtle bug that takes a long time to debug and fix, its uplift on that task is negative. The merged-PR filter removes the clearest failures but not defects discovered after merging.

Conclusion

We presented a system for estimating the engineering output delivered by an autonomous coding agent, measured in equivalent human engineering hours. Validated against 126126 users across eight deployments, the estimator has an rlogr_{log} of 0.740.74 on held-out sessions. Individual estimates are noisy but approximately unbiased; aggregated across a deployment, errors cancel and the total converges toward what engineers report. The system is calibrated to underestimate rather than overestimate delivered output. It is currently in use with Devin customers.