How To Measure If AI Agents Actually Improve Developer Productivity

In 2025, a research nonprofit called METR ran a careful experiment. They took 16 experienced open-source developers, gave them 246 real tasks on codebases they'd worked in for years, and randomly let them use AI tools on some tasks and not others. Then they timed everything.

The developers expected AI to make them about 24% faster. After the study, they reported feeling about 20% faster.

They were actually 19% slower.

Read that again, because it's the whole problem in three numbers. The people doing the work were confident AI sped them up. The stopwatch said the opposite. And if those developers couldn't trust their own gut about whether AI was helping, your engineering org definitely can't trust a vibe in a planning meeting either.

So how do you actually tell? Not "does AI feel productive," because anyone will say yes, but "is this thing making the team ship better software faster, or just generating more motion?" That's a measurement question, and most of the ways people answer it are wrong. Let's fix that.

Why "are we faster?" is the wrong first question

The instinct, when you roll out Copilot or Cursor or a fleet of coding agents, is to ask one question: are we faster now? Find the number that proves it, put it on a slide, move on.

That single-number reflex is exactly what gets you into trouble. Productivity isn't one dimension, and the moment you compress it into one you start optimizing the compression instead of the thing.

The people who study this for a living have been saying so for years. When Nicole Forsgren and a team from Microsoft Research, GitHub, and the University of Victoria published the SPACE framework in ACM Queue in 2021, their entire opening argument was that developer productivity is multidimensional, and teams that try to capture it in a single number consistently make decisions on incomplete information.

AI makes this worse, not better. An AI agent can inflate almost any single metric you pick. Want more commits? It'll write them. More lines of code? Trivially. More pull requests? Sure. None of those tell you whether the product got better or the team got happier. So before picking what to measure, accept the premise: you need a small set of signals from different angles, and at least one of them has to be uncomfortable to game.

The metrics that lie to you

Here's the uncomfortable part. The metrics that are easiest to pull from your tools are the ones AI corrupts fastest.

Lines of code. The oldest bad metric in software, and AI revived it from the dead. An agent will happily produce 400 lines where a senior engineer would've written 40. More code isn't output, it's liability you now have to read, test, and maintain. If your "productivity" went up because the diff sizes tripled, you didn't get faster. You got a bigger surface area to debug.

Pull requests merged. Feels meaningful: a PR is a unit of finished work, right? Except AI lowers the cost of opening a PR to near zero, so the count climbs while the value per PR quietly drops. You'll see "PRs merged up 90%" in vendor case studies. That number on its own tells you nothing about whether those PRs fixed real problems or just churned the codebase.

Suggestion acceptance rate. This is the one AI vendors love, because it's the one they can show you. "Developers accept 30% of suggestions!" Okay, and then how many of those accepted lines survive code review unchanged? How many get reverted next week? Acceptance is the start of the story, not the end. A developer can accept a suggestion, fight it for twenty minutes, and end up slower than if they'd typed it themselves. (That's roughly what happened to METR's developers.)

Commit frequency, keystrokes saved, time-in-editor. Activity metrics. They measure motion, not progress. A team can be furiously busy and shipping nothing that matters.

There's a name for why all of these fail: Goodhart's law, which says that when a measure becomes a target, it stops being a good measure. It was sharp before AI. With an agent that can generate infinite plausible-looking activity on demand, it's lethal. The instant your team learns that "PRs merged" is how AI ROI gets judged, you'll get more PRs and worse software.

The tell for a vanity metric is simple: ask "could an AI agent move this number without making anything actually better?" If yes, it's a vanity metric. Don't put it on the dashboard as a success measure. (It's fine as a diagnostic, more on that later.)

What actually moves the needle

Strip away the vanity metrics and you're left with a much shorter list of things that are genuinely hard to fake, because each one ties to an outcome a customer or a teammate actually feels.

Cycle time is the big one. How long from "started work on this" to "it's running in production"? Not how fast you typed, not how fast the first draft appeared, but the whole journey, including review, CI, and the rework that comes back from review. AI can shrink the first part dramatically and still leave cycle time flat, because the time it saved on writing gets eaten somewhere downstream. If your cycle time isn't dropping, your developers aren't shipping faster, no matter how fast the code appears in the editor.

Review load. This is where AI's hidden cost usually hides. A reviewer can only read so much per day, and AI doesn't make humans read faster. Track three things here: average PR size, review latency (how long PRs wait), and rework rate (how often a PR bounces back for changes). When AI floods the pipe with larger, more numerous PRs, review becomes the bottleneck, and it's a bottleneck you created by going "faster" upstream.

Change failure rate and defect escape. What fraction of your deployments cause a problem that needs a hotfix, rollback, or patch? AI-generated code that passed a quick skim can carry subtle bugs: a plausible-looking error handler that swallows the wrong exception, a config that's almost right. If your change failure rate creeps up after adopting AI, that's the real cost of the speed you think you gained, and it's the one metric a vanity dashboard will never show you.

Developer-reported friction. The squishy one, and the one teams skip, which is a mistake. Ask developers directly, on a regular cadence: how much of your week goes to deep work versus fighting tools? Is it easier or harder to ship than three months ago? Self-report has limits (see: those METR developers who felt faster while being slower), so you never use it alone. But paired with the hard delivery numbers, it catches things metrics miss, like a team that's shipping fine but quietly burning out from reviewing a firehose of agent output.

Notice the shape of this list. Two of these are speed and flow, one is quality, one is human. That's not an accident: it's the multidimensional principle from SPACE, applied. No single number; a small basket that's hard to game in all directions at once.

Borrow a framework, don't invent one

You don't need to design a measurement system from scratch. Three well-tested ones already exist, and the smart move is to steal the parts that fit.

DORA came out of Google's research program and the book Accelerate (Forsgren, Humble, Kim, 2018). It's team-level and delivery-focused, built on four keys: deployment frequency, lead time for changes, change failure rate, and time to restore service. It's the gold standard for "is our delivery pipeline healthy," and it's deliberately blind to individuals, which is a feature.

SPACE (2021) is the wider lens. Five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration, Efficiency and flow. Its core rule is to never measure productivity from a single dimension; pull metrics from at least three. SPACE isn't a fixed list of numbers, it's a checklist for making sure your numbers aren't all measuring the same narrow thing.

DX Core 4 (from the DX team, late 2024) tries to unify DORA, SPACE, and DevEx into four practical dimensions: Speed, Effectiveness, Quality, and Impact. Speed leans on "diffs per engineer," Quality reuses DORA's change failure rate, Impact introduces "percentage of time spent on new capabilities," and Effectiveness uses a survey-based Developer Experience Index (DXI). DX's own research suggests each one-point gain in DXI correlates with roughly 13 minutes saved per developer per week, a nice example of turning that squishy "friction" signal into something you can trend.

Here's how they line up against what we said actually matters:

What you want to know	DORA	SPACE	DX Core 4
Are we shipping faster?	Lead time, deploy frequency	Efficiency & flow	Speed
Is quality holding?	Change failure rate, restore time	Performance	Quality
Are developers okay?	not covered	Satisfaction & well-being	Effectiveness (DXI)
Are we building the right things?	not covered	not covered	Impact
Guards against single-number traps?	Partly (4 keys)	Yes (explicit rule)	Yes (4 dimensions)

Tip
Don't adopt all three. Pick DORA's four keys as your delivery backbone because they're battle-tested and hard to fake, then add one human signal (a SPACE-style satisfaction pulse or a DXI survey). That's a complete, AI-resistant picture for most teams. The framework police are not coming to your standup.

The reallocation trap

Now for the part that explains why AI productivity gains keep evaporating between the demo and the quarterly numbers.

AI is very good at one thing: making the creation of code cheaper. Typing the first draft, scaffolding a component, sketching a test. What it doesn't do is remove the work that comes after creation: understanding the change, reviewing it, verifying it's correct, and owning it when it breaks at 2am.

So the time doesn't disappear. It moves.

Google's 2025 DORA report put real data behind this. AI adoption among developers hit around 90%, and, reversing the previous year's gloomier finding, AI is now associated with higher delivery throughput. Good news. But the same report found AI still has a negative relationship with delivery stability. Teams generate more change, faster, and without strong testing and review practices to absorb it, that extra volume turns into instability downstream. Their framing is the one to remember: AI is an amplifier. It magnifies the strengths of healthy teams and the dysfunctions of struggling ones.

That's the reallocation trap in one sentence: the time you save writing code gets spent auditing it. If you only measure the creation step (acceptance rate, lines generated, "time to first draft"), you'll see a huge win and wonder why nothing ships faster. The win was real. It just got handed to your reviewers, your CI queue, and your on-call rotation.

This is also why measuring only individuals is dangerous. An AI agent can make one developer's personal output metrics soar while quietly increasing the load on everyone reviewing their PRs. The individual looks 2x. The team is flat or worse. Measure the system, not the seat.

A measurement setup you can actually run

Frameworks are nice. Here's how to turn this into something concrete without hiring a research team.

Start with a baseline before you scale up. This is the step everyone skips and then regrets. You can't prove AI changed anything if you don't know where you were. Pull at least a few weeks, ideally a couple of months, of your delivery numbers before a big rollout. The good news is most of this is already sitting in your Git host and CI logs. Lead time, for instance, is mostly a query over PR timestamps:

cycle_time.sql

-- Median hours from first commit to merge, by week.
-- Run this against your PR/commit warehouse before and after AI rollout.
SELECT
  date_trunc('week', pr.merged_at)              AS week,
  percentile_cont(0.5) WITHIN GROUP (
    ORDER BY extract(epoch FROM pr.merged_at - first_commit.committed_at) / 3600
  )                                             AS median_cycle_hours,
  count(*)                                      AS prs
FROM pull_requests pr
JOIN LATERAL (
  SELECT min(committed_at) AS committed_at
  FROM commits c
  WHERE c.pr_id = pr.id
) first_commit ON true
WHERE pr.merged_at IS NOT NULL
GROUP BY 1
ORDER BY 1;

The exact schema doesn't matter. The point is that cycle time is a measurable, boring SQL query, not a survey. Run the same query in three months and you have a real before/after instead of a feeling.

Run a comparison, not just a trend. A plain before/after is vulnerable to confounders: maybe the team also got more senior, or the quarter was just calmer. If you can, do what METR did on a smaller scale. For a set of similar tasks, let AI be used on some and not others, and compare. You won't get a publishable RCT, but even a rough split is far more honest than "the number went up after we bought the tool, therefore the tool did it."

Always pair a hard number with a soft one. Cycle time dropped? Great. But did defect rate climb to pay for it? PRs are up? Fine, but are reviewers drowning? A single metric moving is a question, not an answer. The whole reason for the multidimensional approach is that gaming one number usually shows up as damage in another, if you're watching the other one.

Watch for the reallocation, specifically. Add review latency and rework rate to your dashboard on day one. They're your early-warning system for the trap above. If creation-side metrics improve while review latency climbs, you've found exactly where your AI gains are going.

Keep vanity metrics as diagnostics, not scorecards. Acceptance rate and PR count aren't useless; they're just not success measures. They tell you whether people are using the tool and how the work is shaped. Track them to understand behavior. Never use them to declare victory.

The honest answer

Here's the thing the METR study really teaches, and it isn't "AI makes developers slower." Their result was a snapshot of specific tools, expert developers, and codebases they knew cold, and they were careful to say it doesn't generalize to every setting. (Their 2026 follow-up already shows different numbers.) The durable lesson is smaller and more useful: perception is not measurement. Smart, experienced people were confidently, measurably wrong about their own productivity. The only thing that caught it was a stopwatch and a control group.

Your team is not special enough to be the exception. So if you're rolling out AI agents and someone asks "is it working?", don't answer with how it feels, and don't answer with the metric your vendor put on a slide. Answer with cycle time, review load, change failure rate, and what your developers actually tell you, measured against a baseline you bothered to capture.

That's more work than nodding along to "everyone says it's faster." It's also the only way you'll ever know.

Go capture your baseline before your next rollout. You can't get it back later.

Originally published at nazarboyko.com.

推荐订阅源

DEV Community