Your AI Coding ROI Is Disappearing and Your Dashboard Won't Tell You

The dashboard looks great. The delivery numbers don't.

Your AI coding dashboard looks great. Acceptance rate up. Lines generated up. Developer satisfaction scores up. Your team is thrilled. Management is impressed. The slide deck practically writes itself.

Now ask a different question: has your cycle time improved? Has your post-merge defect rate gone down? Has your review burden per PR decreased?

If you don't know the answers to those questions, you don't know if AI is helping. You know your team feels good. That's not the same thing.

Engineering leaders are measuring AI coding ROI with the wrong instruments. The metrics that are easy to capture look great. The metrics that would tell you whether the AI is actually making your team more effective are mostly going unmeasured. And that gap is where AI investments are disappearing.

The Metrics Everyone Uses (And Why They're Misleading)

Lines of code generated and autocomplete acceptance rate are the default starting points for most AI coding dashboards. They're easy to pull, easy to trend, and easy to show in a QBR. They are also almost entirely useless as productivity signals.

These metrics reward volume, not quality. You can 10x both numbers and slow your team down. Bigger is not better when it comes to code (unless it's "lines of code removed from the codebase"). More lines means more surface area to review, more places for bugs to hide, and more cognitive load for every engineer who touches the code after the author. More to maintain. More to refactor later. The AI doesn't know it's supposed to be frugal. It is, by definition, generative (it's in the name!). Measuring how much it generates and celebrating when the number goes up is like measuring how many ingredients your chef used and calling it a restaurant review. The best dishes are all about quality ingredients and phenomenal execution -- so too with code.

Developer satisfaction is the sneakiest misleading metric of the three. People love feeling fast. The sensation of code appearing faster than you can type it is genuinely mind-blowing (and addictive...I've considered starting a 12-step program for coding agent users and it's not a joke...I've counted 17 open terminal windows on my desktop, working DIFFERENT projects. Not rare to want to start 'just one more' process long after I should be sleeping...definite convo for another post!) It feels like productivity, but it often isn't.

There's a well-documented cognitive bias at play here: when a tool makes early-stage work feel effortless, people systematically rate their overall productivity higher, even when downstream costs eat the gains. The DORA 2025 data makes this concrete at scale: teams nearly doubled their PR merge rate and reported high enthusiasm about AI tools, while organizational delivery metrics stayed flat [1]. Satisfaction scores captured the feeling. The delivery numbers told a different story.

Time to first commit is the third common trap. It measures the wrong finish line. A commit that took 10 minutes to generate but 3 hours to review and 2 days to hunt down and fix the bugs it introduced did not save time. It shifted costs downstream and made them invisible to the metric that was being tracked. You look fast on the front end. The system slows down on the back end. Nobody connects the two. I wrote about this "waterbed problem" some months ago -- I'll include the link at the end of the article if you'd like to read further.

The Numbers You're Ignoring

The research is not subtle about this problem.

DORA's 2025 State of DevOps report found that AI tools increased tasks completed by 21% and PRs merged by 98% [1]. Those are the numbers that end up in the AI vendor case study. Here's what doesn't: organizational delivery metrics stayed flat. More PRs merged. Same delivery performance. The throughput increased. The outcomes didn't follow.

That finding deserves a moment. Organizations nearly doubled their PR merge rate and saw no improvement in delivery. Something in the system was absorbing all the gains. The code was moving faster into the pipeline ... but the pipeline wasn't getting faster.

On quality: CodeRabbit analyzed 470 real-world PRs in December 2025 and found that AI-generated code produces 1.7 times more issues overall and 1.4 times more critical issues than human-authored code [2]. Veracode's data is sharper: AI-generated code contains 2.74 times more security vulnerabilities, with a 45% security flaw rate overall and 72% when just the Java code was reviewed [3].

And on confidence: only 3.8% of developers report both low hallucination rates and high confidence shipping AI-generated code without human review [4]. The other 96.2% are, at minimum, uncertain. Many are doing substantial review work that isn't being measured anywhere.

The PR Size Problem Nobody Is Talking About

DORA 2025 found that AI tools consistently increased PR size by 154% [1].

That is important -- PR size is not a neutral variable. Larger PRs are harder to review. Review quality degrades as PR size increases. Reviewers shift from actually understanding the changes to pattern-matching for obvious errors. Bugs slip through not because reviewers are bad at their jobs but because human attention has limits and a 600-line PR is a different cognitive task than a 400-line one.

You code faster but your pipeline chokes. The AI generates more code per session. That code lands in larger PRs. Those PRs take longer to review and are reviewed less carefully. More issues make it through to merge. Post-merge defect rates climb. Incident rates follow.

This is a systems problem. You optimized one node in the pipeline and degraded the downstream nodes. The metric you were watching (lines generated, PRs merged) went up. The metric you should have been watching (cycle time, defect rate) didn't.

The bottleneck didn't disappear. It moved. And most teams don't have the measurement infrastructure to see where it went.

What to Measure Instead

Four metrics. These aren't exotic. Most engineering teams can instrument them.

Cycle time, commit to deploy. Not commit to commit, not task started to PR opened. Commit to deploy. This captures the full pipeline cost including review time, CI/CD wait time, and any rework loops. If AI is genuinely accelerating delivery, this number should move. If it's flat or growing while PR volume increases, you have the same problem DORA documented.

Post-merge defect rate, segmented by AI-assisted versus human-authored code. This is the quality signal that autocomplete acceptance rate completely misses. Track bugs filed against features and fixes, tag the originating PRs, and compare defect rates across code origin. The CodeRabbit and Veracode numbers suggest you will find a meaningful difference. That difference has a cost you can now put a number on.

Review burden per PR. Time to first review, number of review iterations, and reviewer time spent. This tells you whether the code landing in review is ready to review. If AI-generated PRs are consuming disproportionate reviewer attention, that's a real cost that isn't showing up anywhere in your current dashboard.

Rework rate within 30 days. How much AI-generated code gets substantially rewritten within a month of merge? Code that has to be redone isn't a cost savings. It's a deferral. The initial PR looked like velocity. The rewrite is where you pay it back, with interest.

Implementing the Shift

This doesn't require a new platform. It requires tagging.

Start by tagging PRs by AI involvement. The simplest version: developers mark PRs as AI-assisted, AI-generated, or human-authored. You don't need perfect granularity to start seeing signal.

Then run a 60-day baseline on the four metrics above, segmented by those tags. You will probably see what the research predicts: AI-assisted code moves faster into the pipeline and creates more downstream work. The net effect on cycle time will depend on how your specific team and codebase absorb that tradeoff.

The point isn't to prove AI doesn't work. Some teams will find it does, clearly and measurably. The point is to get honest about where the value is and where the costs are landing. Right now most engineering leaders are flying on instruments that measure activity, not outcomes. You can't optimize what you're not measuring.

Stop celebrating PR volume. Start measuring what happens after the PR.

One practical starting point: pick one team, one sprint, and instrument cycle time and post-merge defects by PR tag. You'll have more signal from that one experiment than from three months of acceptance rate data.

Another thing to track across this same timeframe are token volume and costs (track both -- cost per volume has dropped, but that trajectory is subject to change real soon now as OpenAI gears up to go public and as the business model of subsidized tokens grows less and less tenable). Tracking costs allows legitimate ROI conversations. Tracking token count allows comparison over time as cost metrics change.

The Bottom Line

The metrics most teams are using to measure AI coding ROI are measuring effort and sentiment. They are not measuring delivery performance. They are not measuring quality. They are not measuring whether the system your engineers are embedded in is getting faster or slower, and are not tracking whether any actual improvements have measurable ROI.

DORA doubled the PR merge rate and found flat delivery outcomes [1]. CodeRabbit found 1.7 times more issues in AI-generated code [2]. Veracode found 2.74 times more security vulnerabilities [3]. Developer satisfaction scores climbed while cycle time stayed flat. The dashboard looked great. The numbers didn't lie--they just measured the wrong things.

Measure cycle time. Measure post-merge defects. Measure review burden. Measure rework. If AI is helping your team deliver better software faster, those numbers will tell you. If it's helping your team feel productive while shifting costs downstream, those numbers will tell you that too.

Measure token count and cost. This is the only way to determine actual ROI.

The dashboard that tells you what you want to hear is not a monitoring system. It's a press release.

If this resonated, here are some related articles:

For how AI coding changes what engineers actually need to do: What "100% of Our Code Is Written by AI" Actually Means | Substack
For how AI adoption creates downstream chaos when only one team speeds up: The AI Bullwhip: What The Beer Game Teaches Us About Uneven AI Adoption | Substack
For how AI coding is reshaping the software development process itself: The Irony of AI Development: How Context Engineering Is Taking Us Back to Waterfall | Substack
For what skills actually make engineers productive with AI: The Best AI Engineers Are Product Managers | Substack

References

What metrics are you using to evaluate AI coding tools in your org? Curious whether teams are seeing the same disconnect between activity metrics and delivery outcomes. Drop your experience in the comments.

Keith MacKay is a technology strategy consultant and CTO in EY-Parthenon's Software Strategy Group (SSG), specializing in AI disruption and technology diligence for private equity and corporate clients. SSG's AI Disruption Lab conducts rapid assessments of how AI transforms and threatens existing business models and value chains. Keith teaches at Northeastern University and writes about strategy, management, and AI/technology, with an AI collaborator.

推荐订阅源

DEV Community