Your AI Evaluation Is Biased — By Design

Ask an AI team how they know their system is working and you’ll usually hear a version of the same answer: “We ran it a few times. It seemed pretty good.”

This is vibes-based evaluation. It’s not a failure of inexperienced teams — it’s the default evaluation strategy of the AI era. It requires zero infrastructure. You already have the system, you already have your eyes, you can start evaluating in zero seconds.

The problem isn’t that vibes are lazy. It’s that they’re biased in a specific, dangerous way.

When you informally review AI outputs — skimming through examples, spot-checking responses — you’re not running a random sample. You’re running a biased one.

Impressive outputs are memorable. You notice them, you share them, you hold them as evidence that the system works. Failures are easy to rationalize: unusual input, edge case, bad day. Over dozens of informal reviews, the memorable successes stack up while failures get explained away one by one.

The result: you build confidence in a system based on a sample weighted heavily toward its best performance. You have no idea what’s happening in the tail.

Vibes don’t tell you anything about the distribution of inputs you haven’t checked. They tell you nothing about whether a system that impresses you 80% of the time is catastrophically wrong the other 20%. And they tell you nothing about whether the cases your actual users encounter — at scale, across contexts you didn’t anticipate — resemble the ones you happened to test.

Most AI teams have deployed to production with this level of evidence and called it validated. That’s not a criticism of intent. It’s a description of what zero-infrastructure evaluation actually produces.

Hamel Husain is an independent AI consultant who has built evaluation systems for over thirty organizations. His diagnosis is consistent across all of them: teams invest heavily in building complex AI systems but can’t tell whether their changes are helping or hurting. The teams that succeed, he’s found, barely talk about models or tools. They obsess over measurement.

His prescription — the thing teams consistently resist until they’ve been burned — is also the most boring possible advice: read your traces.

Open the logs. Read actual conversations your system had with real users. Not skimming for sentiment — taking notes on what went wrong and why. Not “bad” or “good” — descriptions. The model misunderstood that the user was asking about rescheduling, not canceling. The response gave the right answer but failed to mention the exception. Keep going until failures stop surprising you and start looking familiar. That’s the pattern emerging.

One case study from Husain’s practice illustrates the payoff. A team doing systematic trace analysis found that three failure modes — conversation flow issues, handoff failures, and date-handling problems — accounted for over 60% of all observed problems. One of those failure modes, once specifically addressed, improved from a 33% success rate to 95%. A single failure mode. Addressed because someone read the logs and named it.

The teams that skip this step optimize endlessly for things that don’t matter while the problems that actually affect users go unnamed, and therefore unfixed.

The barrier isn’t technical. You already have the logs, the time, and the attention. The barrier is psychological.

Reading your system’s failures means confronting your system’s failures. In aggregate. Systematically. Without the rationalizations that make individual failures feel like edge cases.

There’s a second reason teams avoid it: the output of trace review doesn’t look like progress on a roadmap. Nobody celebrates “we read 200 traces and named five failure patterns.” There’s no demo, no new feature, no launch announcement. The work is invisible until the day someone asks “how do we know our system is working?” and one team can answer it and the other cannot.

Here’s why this matters beyond individual product quality: evaluation data is the actual moat.

Not model access — by 2026, frontier model access is a commodity. Every competitor can call the same APIs. What they can’t replicate is a labeled corpus of your specific production failures, at your scale, with your users, in your domain.

That corpus takes real production experience to generate. It captures the specific ways your use case diverges from general benchmarks. It becomes the foundation for every downstream improvement: better prompts, validated fine-tuning, automated evaluation that you trust because you built it from ground truth you generated yourself.

Every team with frontier model access has the same starting point on day one. The teams that build durable, improving systems are the ones that systematically turn their production failures into proprietary signal.

The teams that don’t are running their evaluation on vibes. They’ll get better when the model provider ships a better model — not because they learned anything.

This post expands on Chapter 9 of Wrong by Default: What AI Builders Know That Everyone Else Doesn’t by Alokit. Available on Kindle ($7.99): amazon.com/dp/B0GZCY9CGF

推荐订阅源

Hacker News - Newest: "AI"

Discussion about this post

Ready for more?