Why Story Points Don’t Work in the AI Era, And What Should Take Their Place Instead.

Imagine this scenario at your next sprint review meeting: You're looking good on your velocity graph. But half your team is struggling in their own little hell. Estimations have devolved into Russian roulette. You get a "2-point" done in 45 minutes. You have another "2-point" that takes 4 days because AI-written code introduced a nasty bug in staging that was missed.

This isn't an effort issue; it’s a problem of predictability. Story points, meant for predicting the performance of stable humans, don’t even know what to do with this situation.

I believe it’s time to ditch story points. Let me explain how.

Why I Think the Old System Is Defective

"Story points are a commitment to the idea 'that task is about twice as hard as this task.'" These assumptions include the following:

The developer has approximately the same capacity
The level of complexity correlates well to duration

AI breaks both assumptions.

Throughput is no longer fixed. The same developer, same work, same day—make one tweak to the model and next thing you know they’re getting things done in 40% less time. Or something goes wrong with the model and it takes them two days to fix an hallucinated abstraction.

Complexity not linked to task duration. A very difficult task could become very easy if the AI does it right. But a very simple task could become very difficult if the AI gets it wrong. The variance is much larger than the mean—and story points measure only the mean.

Here’s what I believe a normal team would be seeing after a couple of sprints using AI:

Task Type         | Pre-AI Avg  | Post-AI Range
------------------|-------------|----------------
API endpoint      | 2 days      | 3 hours – 2.5 days
DB migration      | 3 days      | 1 day – 4 days  
UI component      | 1.5 days    | 30 min – 3 days
Legacy refactor   | 5 days      | 2 days – 8 days

If the range exceeds the estimation, then the estimation is noise.

The Hidden Cost That No One Is Estimating

This is my hypothesis on what most teams are failing to estimate: verifying the AI output is the highest-cost activity for almost any task.

A real-time estimate of costs involved would be something like this:

20% of the time: figuring out what to build
15% of the time: the AI building it
65% of the time: going through it, looking for problems

And that third one—the curation tax—falls into an unseen category for everyone. They're estimating the time spent on construction and not on curation. That would be like planning a home renovation based only on how fast someone bangs a hammer.

If teams started taking review-and-validation seriously as part of their estimate process, I'm convinced their accuracy will improve drastically.

What I Suggest to Replace Story Points

1. Confidence-Tagged Estimates

What I suggest: Every ticket must be provided with two things—a time estimate and a confidence tag.

ticket: INDE-002 — Migrate auth service to new SDK
estimate: 1.5 days
confidence: low
reason: "New SDK, AI hasn't seen our auth patterns before"
action: spike first (0.5 day), then re-estimate

The confidence tag is the important part. It tells the PM whether to trust the number or treat it as a hypothesis.

Three bands:

Tag	What it means	How to plan
`high`	Done this before with AI, know the variance	Plan on the estimate
`medium`	Familiar territory, some unknowns	Buffer 2x
`low`	Novel task or AI-unfriendly domain	Spike first, don't commit

My recommendation for the rule: always spike a ticket marked low before adding it to a sprint. My guess is that just one rule can prevent almost all sprint meltdowns in AI-driven teams.

2. "Free Second Time" Paradigm

It’s interesting how I notice a certain pattern: when doing any task for the first time with the help of artificial intelligence, it’s very costly. But the second time the same is done, the cost is reduced by 60%. And by the third time, it becomes pretty much negligible.

How can this happen? Well, the first attempt makes one develop a specific workflow – an optimal prompt structuring, a proper context window configuration, and everything that might go wrong during execution.

That being said, I believe the cost estimation should be done in a different way:

First instance of task type:  estimate × 2 (you're building the workflow)
Second instance:              estimate × 0.8 (refining the workflow)
Third+ instance:              estimate × 0.4 (executing the workflow)

Example: consider migration of 12 microservices using a new observability SDK.

Service 1: 6 hours (thinking about how to do it, writing prompts)
Service 2: 2.5 hours (fine-tuning, dealing with edge cases)
Services 3-12: ~45 minutes each (batches in bulk using known procedure)

Old estimate: 12 x 4 hours = 48 hours. This approach: ~16 hours. But only if you put in the effort on service 1 rather than rushing through.

3. Review-Weighted Sizing

I don’t believe that one should size by "how difficult will it be to develop." Instead, one must size by "how difficult will it be to verify?"

The easiest pieces to create are often very difficult to review (large refactors, verbose migrations), while difficult pieces to generate are simple to review (small algorithmic fixes with explicit test cases).

This sizing rubric must be inverted:
| Old thinking | New thinking |
|-------------|-------------|
| "Lots of code = big ticket" | "Lots of code to review = big ticket" |
| "Complex logic = big ticket" | "Ambiguous correctness criteria = big ticket" |
| "New framework = big ticket" | "AI-unfamiliar patterns = big ticket" |

500-lines of boilerplate migration needs to be large not because it’s difficult to generate, which an AI can do within minutes, but because checking for nuanced differences in 500 lines of code is truly costly.

How This Changes The PM Conversation

The hardest part of any estimation paradigm shift isn’t technical. It’s explaining the change.

Old conversation:

"This epic is 34 story points. At our velocity of 21/sprint, it’ll take ~1.6 sprints."

Where I think this discussion needs to go:

"This epic has 8 tickets. 5 of which are high-confidence tickets (we’re going to meet the estimates here). 2 are medium confidence (double our estimates). 1 is low confidence (we need a spike day to be sure about that). Optimistic estimate: 1 sprint. Pessimistic: 1.5 sprints. What if the low-confidence ticket is a problem? 2.5 sprints."

Bigger sentences? Yes. More useful? Absolutely. PMs have a choice to make now: "Pull out the low-confidence ticket and ship everything else on time" is now a discussion that you can have.

Metrics Worth Tracking Instead

Velocity as a metric should be scrapped in favor of more useful measures, which I would propose to track include the following:

Curating rate – proportion of review time vs creation time. Goal: below 3:1.
Confidence success rate – proportion of 'high' tickets that make into the estimate.
Process reuse rate – frequency of reusing a process for second similar task vs creating anew.
Spike conversion rate – after spike how often 'low' ticket turns into 'medium' or 'high'.

These measures will inform about the progress of the team in collaboration with the AI, as opposed to going 'fast'.

TL;DR – The Replacement Kit

If you are still Fibonacci-estimating stories in 2026 and asking why sprints are akin to playing Russian roulette, here’s my suggestion:

Confidence tagging for estimates — confidence level will matter more than estimate itself
Curation effort estimation vs Construction effort estimation — curation will be the hard work
Novelty tracking for each task — new tasks are 2-3 times costlier than recurring tasks
Task size based on difficulty of reviewing and not generation — reverse the complexity paradigm
Spike before undertaking high uncertainty tasks — one simple rule to massively reduce blowups

In my mind, however, the fundamental change that needs to happen is as follows: before, estimation was focused on how much time it takes to build something. Today, it’s time to focus on how much time it takes to validate it. Change the paradigm, and suddenly, sprint planning starts reflecting reality.

Obviously just one way of seeing things—there are many brilliant minds out there who have figured out how to make story points work using AI-driven adjustments. Where do you stand: adding to the old system or building a new one from scratch?

推荐订阅源

DEV Community