



























Six months ago, every CTO was concerned their team wasn't using enough tokens. That trend has reversed as token usage and AI spend have skyrocketed. Engineering leaders are now trying to figure out how to measure actual output, because not every token delivers real value. Some save engineering hours and accelerate projects; others are wasted on useless sessions and bad prompting.
For any organization of sufficient scale, it's essentially impossible to measure value over thousands of sessions & billions of tokens. We set out to automate this with AI.
Predicting real ROI is hard, for reasons we’ll go into later. So we focused on a key sub-task: estimating how many productive engineering hours each Devin session is worth.
LLMs are notoriously bad at time estimates — so we expected this to be a struggle. After carefully tuning the system, however, our model has an rlogr_{log} of 0.740.74 and appears to be unbiased. Individual predictions aren’t perfect, but the model is good enough to be used for estimating aggregated totals. They’re also convertible to dollar amounts using engineering salaries, getting us closer to business value.
In our system, an agent reviews each completed Devin session — first classifying whether it produced useful output, then estimating how long a human engineer would have taken to produce the same work. We validated it by asking human engineers how long they would have spent on the same tasks.
The system is now running with customers. To our knowledge, this is the first automated system measuring AI engineering productivity in production.
How did we land on productive engineering hours as our metric? The first question we needed to answer was what to estimate. Ideally, we'd measure dollar impact directly, such as revenue attributable to features shipped or costs avoided by bugs fixed. In practice, this is still an unsolved problem in our field. It’s incredibly hard for an engineer to know how many dollars of business value they created through the PRs shipped last week.
On the other end of the spectrum, we could measure raw activity: lines of code, commits, PRs, tokens consumed. These are easy to collect but don't correspond to effort. A mechanical refactor can touch thousands of lines in an afternoon; a two-line bug fix can represent hours of investigation. Many valuable tasks — triaging bugs, running analytics queries, reviewing code — produce no code at all.
The middle ground we decided to measure is human engineering hours: how long would a human engineer have taken to produce the same output? Hours are already how organizations value engineering work — salaries and contractor rates are denominated in time. When leadership evaluates an investment, they think in terms of time and cost savings. Hours are standardized across organizations, independent of business context, and convertible to dollars via engineering rates.
But not all hours are equal. For example, if all PRs created by a session were closed, it likely wasn’t valuable. We wanted to measure only productive engineering hours. So, we also had to build a system to classify whether the sessions were actually productive.
We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258258 sessions from 126126 users across a diverse set of enterprise customers. We collected the data via live interviews and a survey.
Every Devin session has a full execution trace: the user's request, every action taken, the resulting code, codebase context. This gives us a record of production engineering work at a level of detail that is difficult to obtain from surveys, aggregate activity metrics, or open-source benchmarks alone.
In the charts below we analyze our dataset. Our dataset consists of a distribution representing real enterprise workloads, spanning a variety of languages, frameworks, session types, and hour estimates.
As we reviewed our dataset, we realized that not all sessions correspond to useful work, and that we would need to build a classifier to filter out unproductive sessions.
For sessions with a PR, this is relatively straightforward: if any PR from the session is merged, we include the estimate; if not, we discard it. This is slightly lossy; sessions with all closed PRs can still have delivered productive work, but we wanted to err conservative.
Sessions without a PR are more complicated. We built a classifier to filter out unproductive sessions, which removes around 1−20%1-20\% of sessions, depending on the customer. Many non-PR sessions are genuinely productive — e.g., finding unused dependencies, scanning for security vulnerabilities, reviewing a pull request, running analytics queries, triaging a bug — and we retain those. We discard sessions where the agent lacked access to carry out the task, sessions where the agent asked for clarification and the user never replied, and other scenarios where Devin was unable to meaningfully advance the task.
After we identified which sessions were productive, we needed to actually estimate their equivalent engineering hours. To do this, we built an estimator agent with two key components: context and prompts.
On the context side, we included as much information about the session as possible — the user's messages, the PR produced (if applicable), the full agent trace (viewing logs, tracing through code, fixing lint, etc.), and additional codebase context from DeepWiki.
On the prompt side, we set aside 2525 sessions as a development set for iteration. By manually triaging agent runs on these sessions and reasoning critically about failure modes, we arrived at the following design principles: credit only the work Devin actually saved and compare Devin against a conservative human reference. Concretely, we:
On the held-out evaluation set of 233233 sessions, our estimator has an rlogr_{log} of 0.740.74 and rlog2r_{log}^2 of 0.540.54. The correlation is highly statistically significant (F(1,231)=279.9F(1,231) = 279.9, p<10−5p < 10^{-5}). Around 50%50\% of sessions fall within a factor of 22 of the true estimate. Individual estimates are noisy — 22–33× errors in either direction are common — but because errors are roughly unbiased and independent, they cancel as session count grows and the aggregate converges toward the human-reported total.
A lot of the noise comes from variance between users, both in how they estimate and in genuine differences in speed. Roughly half the residual disagreement lies between users rather than within a user's own sessions ( ICC=0.58ICC =0.58). We considered per-user calibration, for example, prompting users in-product to give a few estimates for bootstrapping. We decided against it for simplicity and since, for our purpose of aggregation, estimating relative to an "average" user is sufficient.
Our initial, uncalibrated model consistently underestimated. To correct this, we fit a linear regression in log-space: h=2.28×m0.923h = 2.28 \times m^{0.923}. This is close to a constant multiplier, with a slope slightly below one. Constraining the slope to one (a single multiplicative constant of 2.082.08) changes every metric by at most 0.010.01. Residuals by bucket show no systematic trend after calibration.
Even after this correction, the total of the human estimates remains 1.4×1.4\times the total of the corrected model estimates. This gap is expected: an estimator that is unbiased in log-space becomes biased once its predictions are summed in linear space, systematically underestimating the total. To see why, consider a simplified estimator that is equally likely to be off by a factor of two in either direction. When it predicts 22 hours, the true value is equally likely to be 11 hour or 44 hours, giving an expected value of 1+42=2.5\frac{1+4}{2} = 2.5 — 25%25\% above the prediction. Summing such predictions therefore underestimates the true total. Rather than apply a further correction for this, we report the unadjusted figure as a deliberately conservative underestimate.
We also tested simpler predictors to understand how much of the signal comes from the final code change versus the full Devin session. The first regresses a single scalar, the total lines changed (additions + deletions summed across all PRs in the session), against our human estimates. It performed poorly, with an Rlog2R^2_{log} of 0.270.27, confirming that code volume is a weak proxy for engineering effort. We then evaluated an estimator agent given only the trace of the agent's edit tool calls as context, with no user messages or other session activity. It performed better, but still lagged the full estimator, suggesting that important signal lives outside the diff.
These results match how engineering work actually happens. The effort in a session often comes from investigation, diagnosis, environment setup, reasoning through tradeoffs, or producing useful non-code outputs. Those signals are visible in the session trace (user messages, actions taken, intermediate observations, and codebase context) but not always in the final code change.
The regression results below are for the 129129 sessions in our evaluation dataset for which we have line-count data (the number of lines added and deleted across the session's pull requests). A total of 170170 sessions in our evaluation dataset created PRs, but our integrations with some of our customer's git hosting platforms do not capture diff statistics.

Two recent studies have also used LLMs for effort estimation, both positively confirming the feasibility of this approach. METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 77 METR technical staff. On 3434 sessions labeled on human ground truth, their estimator had an rlogr_{log} of 0.830.83 . Our rlogr_{log} is lower likely due to our data being collected from a much more diverse set of users.
Anthropic (2026) estimated task duration on 1,0001,000 open-source Jira tickets using Claude, but the estimator only had the ticket title and description to work with. They had an rlogr_{log} of 0.460.46; human developers estimating the same tickets reached rlog=0.67r_{log} = 0.67. Our system establishes stronger correlation than Anthropic’s ( rlog=0.74r_{log} = 0.74 vs 0.460.46 ) because we have far more granular data per session. As we’ve shown in our evaluation experiments, that significantly improves the accuracy.
We presented a system for estimating the engineering output delivered by an autonomous coding agent, measured in equivalent human engineering hours. Validated against 126126 users across eight deployments, the estimator has an rlogr_{log} of 0.740.74 on held-out sessions. Individual estimates are noisy but approximately unbiased; aggregated across a deployment, errors cancel and the total converges toward what engineers report. The system is calibrated to underestimate rather than overestimate delivered output. It is currently in use with Devin customers.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。