惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Threat Research - Cisco Blogs
S
Securelist
H
Heimdal Security Blog
Scott Helme
Scott Helme
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Spread Privacy
Spread Privacy
Cyberwarzone
Cyberwarzone
V
Vulnerabilities – Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
C
CERT Recently Published Vulnerability Notes
P
Proofpoint News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
人人都是产品经理
人人都是产品经理
C
Cisco Blogs
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Engineering at Meta
Engineering at Meta
Project Zero
Project Zero
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
有赞技术团队
有赞技术团队
T
Tailwind CSS Blog
Cisco Talos Blog
Cisco Talos Blog
Last Week in AI
Last Week in AI
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
O
OpenAI News
P
Proofpoint News Feed
Google Online Security Blog
Google Online Security Blog
Recent Announcements
Recent Announcements
Hacker News: Ask HN
Hacker News: Ask HN
美团技术团队
Stack Overflow Blog
Stack Overflow Blog
U
Unit 42
P
Privacy International News Feed
Google DeepMind News
Google DeepMind News
G
GRAHAM CLULEY
Apple Machine Learning Research
Apple Machine Learning Research
TaoSecurity Blog
TaoSecurity Blog
S
Security @ Cisco Blogs
C
Check Point Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Jina AI
Jina AI
S
Secure Thoughts
G
Google Developers Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
L
LINUX DO - 最新话题
T
Tenable Blog
Latest news
Latest news
I
InfoQ

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
How To Measure If AI Agents Actually Improve Developer Productivity
Nazar Boyko · 2026-06-21 · via DEV Community

In 2025, a research nonprofit called METR ran a careful experiment. They took 16 experienced open-source developers, gave them 246 real tasks on codebases they'd worked in for years, and randomly let them use AI tools on some tasks and not others. Then they timed everything.

The developers expected AI to make them about 24% faster. After the study, they reported feeling about 20% faster.

They were actually 19% slower.

Read that again, because it's the whole problem in three numbers. The people doing the work were confident AI sped them up. The stopwatch said the opposite. And if those developers couldn't trust their own gut about whether AI was helping, your engineering org definitely can't trust a vibe in a planning meeting either.

So how do you actually tell? Not "does AI feel productive," because anyone will say yes, but "is this thing making the team ship better software faster, or just generating more motion?" That's a measurement question, and most of the ways people answer it are wrong. Let's fix that.

Why "are we faster?" is the wrong first question

The instinct, when you roll out Copilot or Cursor or a fleet of coding agents, is to ask one question: are we faster now? Find the number that proves it, put it on a slide, move on.

That single-number reflex is exactly what gets you into trouble. Productivity isn't one dimension, and the moment you compress it into one you start optimizing the compression instead of the thing.

The people who study this for a living have been saying so for years. When Nicole Forsgren and a team from Microsoft Research, GitHub, and the University of Victoria published the SPACE framework in ACM Queue in 2021, their entire opening argument was that developer productivity is multidimensional, and teams that try to capture it in a single number consistently make decisions on incomplete information.

AI makes this worse, not better. An AI agent can inflate almost any single metric you pick. Want more commits? It'll write them. More lines of code? Trivially. More pull requests? Sure. None of those tell you whether the product got better or the team got happier. So before picking what to measure, accept the premise: you need a small set of signals from different angles, and at least one of them has to be uncomfortable to game.

The metrics that lie to you

Here's the uncomfortable part. The metrics that are easiest to pull from your tools are the ones AI corrupts fastest.

Lines of code. The oldest bad metric in software, and AI revived it from the dead. An agent will happily produce 400 lines where a senior engineer would've written 40. More code isn't output, it's liability you now have to read, test, and maintain. If your "productivity" went up because the diff sizes tripled, you didn't get faster. You got a bigger surface area to debug.

Pull requests merged. Feels meaningful: a PR is a unit of finished work, right? Except AI lowers the cost of opening a PR to near zero, so the count climbs while the value per PR quietly drops. You'll see "PRs merged up 90%" in vendor case studies. That number on its own tells you nothing about whether those PRs fixed real problems or just churned the codebase.

Suggestion acceptance rate. This is the one AI vendors love, because it's the one they can show you. "Developers accept 30% of suggestions!" Okay, and then how many of those accepted lines survive code review unchanged? How many get reverted next week? Acceptance is the start of the story, not the end. A developer can accept a suggestion, fight it for twenty minutes, and end up slower than if they'd typed it themselves. (That's roughly what happened to METR's developers.)

Commit frequency, keystrokes saved, time-in-editor. Activity metrics. They measure motion, not progress. A team can be furiously busy and shipping nothing that matters.

There's a name for why all of these fail: Goodhart's law, which says that when a measure becomes a target, it stops being a good measure. It was sharp before AI. With an agent that can generate infinite plausible-looking activity on demand, it's lethal. The instant your team learns that "PRs merged" is how AI ROI gets judged, you'll get more PRs and worse software.

The tell for a vanity metric is simple: ask "could an AI agent move this number without making anything actually better?" If yes, it's a vanity metric. Don't put it on the dashboard as a success measure. (It's fine as a diagnostic, more on that later.)

What actually moves the needle

Strip away the vanity metrics and you're left with a much shorter list of things that are genuinely hard to fake, because each one ties to an outcome a customer or a teammate actually feels.

Cycle time is the big one. How long from "started work on this" to "it's running in production"? Not how fast you typed, not how fast the first draft appeared, but the whole journey, including review, CI, and the rework that comes back from review. AI can shrink the first part dramatically and still leave cycle time flat, because the time it saved on writing gets eaten somewhere downstream. If your cycle time isn't dropping, your developers aren't shipping faster, no matter how fast the code appears in the editor.

Review load. This is where AI's hidden cost usually hides. A reviewer can only read so much per day, and AI doesn't make humans read faster. Track three things here: average PR size, review latency (how long PRs wait), and rework rate (how often a PR bounces back for changes). When AI floods the pipe with larger, more numerous PRs, review becomes the bottleneck, and it's a bottleneck you created by going "faster" upstream.

Change failure rate and defect escape. What fraction of your deployments cause a problem that needs a hotfix, rollback, or patch? AI-generated code that passed a quick skim can carry subtle bugs: a plausible-looking error handler that swallows the wrong exception, a config that's almost right. If your change failure rate creeps up after adopting AI, that's the real cost of the speed you think you gained, and it's the one metric a vanity dashboard will never show you.

Developer-reported friction. The squishy one, and the one teams skip, which is a mistake. Ask developers directly, on a regular cadence: how much of your week goes to deep work versus fighting tools? Is it easier or harder to ship than three months ago? Self-report has limits (see: those METR developers who felt faster while being slower), so you never use it alone. But paired with the hard delivery numbers, it catches things metrics miss, like a team that's shipping fine but quietly burning out from reviewing a firehose of agent output.

Notice the shape of this list. Two of these are speed and flow, one is quality, one is human. That's not an accident: it's the multidimensional principle from SPACE, applied. No single number; a small basket that's hard to game in all directions at once.

Borrow a framework, don't invent one

You don't need to design a measurement system from scratch. Three well-tested ones already exist, and the smart move is to steal the parts that fit.

DORA came out of Google's research program and the book Accelerate (Forsgren, Humble, Kim, 2018). It's team-level and delivery-focused, built on four keys: deployment frequency, lead time for changes, change failure rate, and time to restore service. It's the gold standard for "is our delivery pipeline healthy," and it's deliberately blind to individuals, which is a feature.

SPACE (2021) is the wider lens. Five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration, Efficiency and flow. Its core rule is to never measure productivity from a single dimension; pull metrics from at least three. SPACE isn't a fixed list of numbers, it's a checklist for making sure your numbers aren't all measuring the same narrow thing.

DX Core 4 (from the DX team, late 2024) tries to unify DORA, SPACE, and DevEx into four practical dimensions: Speed, Effectiveness, Quality, and Impact. Speed leans on "diffs per engineer," Quality reuses DORA's change failure rate, Impact introduces "percentage of time spent on new capabilities," and Effectiveness uses a survey-based Developer Experience Index (DXI). DX's own research suggests each one-point gain in DXI correlates with roughly 13 minutes saved per developer per week, a nice example of turning that squishy "friction" signal into something you can trend.

Here's how they line up against what we said actually matters:

What you want to know DORA SPACE DX Core 4
Are we shipping faster? Lead time, deploy frequency Efficiency & flow Speed
Is quality holding? Change failure rate, restore time Performance Quality
Are developers okay? not covered Satisfaction & well-being Effectiveness (DXI)
Are we building the right things? not covered not covered Impact
Guards against single-number traps? Partly (4 keys) Yes (explicit rule) Yes (4 dimensions)

Tip
Don't adopt all three. Pick DORA's four keys as your delivery backbone because they're battle-tested and hard to fake, then add one human signal (a SPACE-style satisfaction pulse or a DXI survey). That's a complete, AI-resistant picture for most teams. The framework police are not coming to your standup.

The reallocation trap

Now for the part that explains why AI productivity gains keep evaporating between the demo and the quarterly numbers.

AI is very good at one thing: making the creation of code cheaper. Typing the first draft, scaffolding a component, sketching a test. What it doesn't do is remove the work that comes after creation: understanding the change, reviewing it, verifying it's correct, and owning it when it breaks at 2am.

So the time doesn't disappear. It moves.

Google's 2025 DORA report put real data behind this. AI adoption among developers hit around 90%, and, reversing the previous year's gloomier finding, AI is now associated with higher delivery throughput. Good news. But the same report found AI still has a negative relationship with delivery stability. Teams generate more change, faster, and without strong testing and review practices to absorb it, that extra volume turns into instability downstream. Their framing is the one to remember: AI is an amplifier. It magnifies the strengths of healthy teams and the dysfunctions of struggling ones.

That's the reallocation trap in one sentence: the time you save writing code gets spent auditing it. If you only measure the creation step (acceptance rate, lines generated, "time to first draft"), you'll see a huge win and wonder why nothing ships faster. The win was real. It just got handed to your reviewers, your CI queue, and your on-call rotation.

This is also why measuring only individuals is dangerous. An AI agent can make one developer's personal output metrics soar while quietly increasing the load on everyone reviewing their PRs. The individual looks 2x. The team is flat or worse. Measure the system, not the seat.

Flow diagram: AI shrinks the write-code stage by about half, but the saved time reappears as added load on the review and verify/test stages, which grow larger, while operate stays steady.

A measurement setup you can actually run

Frameworks are nice. Here's how to turn this into something concrete without hiring a research team.

Start with a baseline before you scale up. This is the step everyone skips and then regrets. You can't prove AI changed anything if you don't know where you were. Pull at least a few weeks, ideally a couple of months, of your delivery numbers before a big rollout. The good news is most of this is already sitting in your Git host and CI logs. Lead time, for instance, is mostly a query over PR timestamps:

cycle_time.sql

-- Median hours from first commit to merge, by week.
-- Run this against your PR/commit warehouse before and after AI rollout.
SELECT
  date_trunc('week', pr.merged_at)              AS week,
  percentile_cont(0.5) WITHIN GROUP (
    ORDER BY extract(epoch FROM pr.merged_at - first_commit.committed_at) / 3600
  )                                             AS median_cycle_hours,
  count(*)                                      AS prs
FROM pull_requests pr
JOIN LATERAL (
  SELECT min(committed_at) AS committed_at
  FROM commits c
  WHERE c.pr_id = pr.id
) first_commit ON true
WHERE pr.merged_at IS NOT NULL
GROUP BY 1
ORDER BY 1;

The exact schema doesn't matter. The point is that cycle time is a measurable, boring SQL query, not a survey. Run the same query in three months and you have a real before/after instead of a feeling.

Run a comparison, not just a trend. A plain before/after is vulnerable to confounders: maybe the team also got more senior, or the quarter was just calmer. If you can, do what METR did on a smaller scale. For a set of similar tasks, let AI be used on some and not others, and compare. You won't get a publishable RCT, but even a rough split is far more honest than "the number went up after we bought the tool, therefore the tool did it."

Always pair a hard number with a soft one. Cycle time dropped? Great. But did defect rate climb to pay for it? PRs are up? Fine, but are reviewers drowning? A single metric moving is a question, not an answer. The whole reason for the multidimensional approach is that gaming one number usually shows up as damage in another, if you're watching the other one.

Watch for the reallocation, specifically. Add review latency and rework rate to your dashboard on day one. They're your early-warning system for the trap above. If creation-side metrics improve while review latency climbs, you've found exactly where your AI gains are going.

Keep vanity metrics as diagnostics, not scorecards. Acceptance rate and PR count aren't useless; they're just not success measures. They tell you whether people are using the tool and how the work is shaped. Track them to understand behavior. Never use them to declare victory.

The honest answer

Here's the thing the METR study really teaches, and it isn't "AI makes developers slower." Their result was a snapshot of specific tools, expert developers, and codebases they knew cold, and they were careful to say it doesn't generalize to every setting. (Their 2026 follow-up already shows different numbers.) The durable lesson is smaller and more useful: perception is not measurement. Smart, experienced people were confidently, measurably wrong about their own productivity. The only thing that caught it was a stopwatch and a control group.

Your team is not special enough to be the exception. So if you're rolling out AI agents and someone asks "is it working?", don't answer with how it feels, and don't answer with the metric your vendor put on a slide. Answer with cycle time, review load, change failure rate, and what your developers actually tell you, measured against a baseline you bothered to capture.

That's more work than nodding along to "everyone says it's faster." It's also the only way you'll ever know.

Go capture your baseline before your next rollout. You can't get it back later.


Originally published at nazarboyko.com.