惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

Hacker News - Newest: "AI"

Big university system is embracing AI. Students/faculty aren't all on board AI Datacenters Were Built for GPUs. What Happens When You Remove the GPUs? An AI Interface for Research Papers Agentic AI Changes the CPU/GPU Equation Deconstructing Cognitive Overload: Deep Self-Understanding Ubers COO says its getting harder to justify the money spent on AI tokenmaxxing GitHub - bitomule/musts: The validation loop that stops AI coding agents from claiming work is done before it actually is. CoworkGuard — Runtime Visibility for AI Tools Is AI flattening your team’s creativity? Here’s how to tell. Feynman - AI research assistant SynapCores — the AI-native database GitHub - Noumenon-ai/AutoMaxFix: Controlled AI repair loop. Audit → Reproduce → Patch → Test → Report. Safety boundaries most AI agents skip. Show HN: Hackobar – One feed for AI news GitHub - agentpatterns-ai/website: Website content for agentpatterns.ai Torvalds Tightens Linux Kernel Rules to Reject Deluge of Low-Value AI Fixes Anthropic's Olah says AI must be guided from outside Big Tech How to get your team past the AI coding plateau The Stepford AI PhoneDiffusion App - App Store Anthropic Billionaire Cofounder Joins Pope Leo, Warns AI Job Losses Will Spark "Moral Imperative Of Historic Proportions" GitHub - kian9375/seoclaw-by-kb-software: Open source AI SEO optimizer CLI — made by KianBot.ai Credential Brokering for AI Agents, Explained | Infisial Linus Torvalds Is Unhappy About the AI Influence in Linux Kernel Development Plain Markdown | Webpage to Markdown Browser Extension Grappling with AI Margin Points - Arnold Engel GrillKit – self-hosted AI technical interview trainer with voice Pope Leo’s Unsettling Vision of the AI Future One Endpoint. Zero Credentials. Eight Confirmed Vulnerabilities. Repolog — SEO, Performance, Security & AI Readiness audits An AI-generated film premiered at Cannes The uncritical adoption of AI in science is alarming — we urgently need guard rails Microsoft just banned its own engineers from using AI twitter.com GitHub - sovseal/core: Zero-Knowledge memory for AI Agents Not All On-Device AI Is The Same: How Chip Compute Tiers Decide What Your Product Can Actually Do – Easelink Tech RCF Protocol – license layer to protect code semantics from AI replication Pope Leo XIV says AI must serve humanity, not the powerful few Do you review AI generated code differently based on where it is in your code? Amazon launches new AI Wearable "Bee" bilibili Ask HN: Do you embrace AI in your life and business? Mnemosyne — The Zero-Dependency AI Memory System 21 Free Agentic AI Design Patterns for Developers (2026) Google is cannibalizing the web to feed AI Silicon Valley takes its AI pitch to the pope How to scan for vulnerabilities with GitHub Security Lab’s open source AI-powered framework AI Model Idle · 인공지능 키우기 @levelsio (@levelsio) America's plutonium puzzle: from cold war relics to AI ambitions AI can chart a course to disaster faster than humans can notice Final Fantasy Creator Call AI-Generated Final Fantasy 6 Remake Video 'Amazing' Pope Leo Compares AI Threat to Biblical 'Tower of Babel' Faster Than We Can Patch Pope Leo denounces ‘culture of power’ driving rise of AI Pope Leo Issues AI Encyclical Warning Against 'Opaque Algorithms' Pope Leo’s ‘Magnifica humanitas’: AI must serve humanity not concentrate power The AI Era Is Creating a Bug Hunting Arms Race The AI-Native Developer – Queue Show HN: An open-source, interactive AI engineering syllabus (1,100 papers) 教皇利奥警告称,应防止人工智能“统治人类” Mark Zuckerberg's Right-Hand Man Who's Unleashing AI at Meta GitHub - Espenandreass1/agentslice: A Markdown workflow kit that makes Cursor, Claude Code, Codex and Windsurf ask before they edit. Show HN: I Built a Debugging Challenge for the AI Coding Age Gemma 4: A new, budget-focused model in Posit AI Pope Leo warns AI revolution driven by ‘idolatry of profit’ My AI agent called my code shit and took an unannounced vacation mid-sprint HTML Deployer: 1-Click AI Code To Website Publisher - Chrome 应用商店 College Kids Don't Want Your AI [video] How I Used AI to Untangle a Legacy Service I'd Never Touched Before — The AI Leverage Weekly Greetings, Class of 2026 Have You Heard About AI? Wait, Why Are You Booing? AI guardrails stripped from Meta and Google models in minutes Uvora Growth OS – AI marketing automation and lead generation platform The Essential Cloud for AI: Why Purpose-Built Defines the Future of Intelligence No, AI is not making software worse, people are - Raphael Amorim If you let AI do your writing, I will come to your house and kill you Why The AI Boom Is Reshuffling The Global Stock Market Hierarchy AI Makes Adding Features Faster - So Why Not Add Just One More? Ask HN: How to get back into programming without AI? How Claude's AI model may cause security issues for your money Kevin O'Leary wants to build a massive AI data centre in Utah. Some residents aren't happy My AI coding flow was burning tokens to do things code should do Show HN: Live AI music sequencing agent The Dark Between the Stars GitHub - lynote-ai/humanize-text: Free open-source AI text humanizer to convert AI-generated content into undetectable, human-like writing. Bypass Turnitin, GPTZero, and all major AI detectors. No sign-up required. Try our unlimited free online tool Sign in Nobody Wants AI Anymore [video][12 mins] AI Has Taken Over Open Source How to Teach AI the "Taste" Global AI Diffusion: Q1 2026 Trends and Insights [pdf] HN: Silau – AI detects employee burnout" How AI Talks People Out of Conspiracy Theories–and What We Can Learn from That What to know about the AI models that are jolting Washington AI for design needs solving | by Megha Agrawal Client Challenge Predicting AI job exposure — Benedict Evans Google has seriously leaned into AI enshittification lately AI is becoming increasingly unpopular AI-Driven Design Automation What's Left for AI-Assisted Coding GitHub - Totes-MickGOATs/mcgoats-game-template: AI-powered game development template with CI/CD, auto-merge queue, TDD enforcement, 3-layer master protection, and 50+ skills for Godot/Unity/Unreal
Your AI Evaluation Is Biased — By Design
avikalp · 2026-05-26 · via Hacker News - Newest: "AI"

Ask an AI team how they know their system is working and you’ll usually hear a version of the same answer: “We ran it a few times. It seemed pretty good.”

This is vibes-based evaluation. It’s not a failure of inexperienced teams — it’s the default evaluation strategy of the AI era. It requires zero infrastructure. You already have the system, you already have your eyes, you can start evaluating in zero seconds.

The problem isn’t that vibes are lazy. It’s that they’re biased in a specific, dangerous way.

When you informally review AI outputs — skimming through examples, spot-checking responses — you’re not running a random sample. You’re running a biased one.

Impressive outputs are memorable. You notice them, you share them, you hold them as evidence that the system works. Failures are easy to rationalize: unusual input, edge case, bad day. Over dozens of informal reviews, the memorable successes stack up while failures get explained away one by one.

The result: you build confidence in a system based on a sample weighted heavily toward its best performance. You have no idea what’s happening in the tail.

Vibes don’t tell you anything about the distribution of inputs you haven’t checked. They tell you nothing about whether a system that impresses you 80% of the time is catastrophically wrong the other 20%. And they tell you nothing about whether the cases your actual users encounter — at scale, across contexts you didn’t anticipate — resemble the ones you happened to test.

Most AI teams have deployed to production with this level of evidence and called it validated. That’s not a criticism of intent. It’s a description of what zero-infrastructure evaluation actually produces.

Hamel Husain is an independent AI consultant who has built evaluation systems for over thirty organizations. His diagnosis is consistent across all of them: teams invest heavily in building complex AI systems but can’t tell whether their changes are helping or hurting. The teams that succeed, he’s found, barely talk about models or tools. They obsess over measurement.

His prescription — the thing teams consistently resist until they’ve been burned — is also the most boring possible advice: read your traces.

Open the logs. Read actual conversations your system had with real users. Not skimming for sentiment — taking notes on what went wrong and why. Not “bad” or “good” — descriptions. The model misunderstood that the user was asking about rescheduling, not canceling. The response gave the right answer but failed to mention the exception. Keep going until failures stop surprising you and start looking familiar. That’s the pattern emerging.

One case study from Husain’s practice illustrates the payoff. A team doing systematic trace analysis found that three failure modes — conversation flow issues, handoff failures, and date-handling problems — accounted for over 60% of all observed problems. One of those failure modes, once specifically addressed, improved from a 33% success rate to 95%. A single failure mode. Addressed because someone read the logs and named it.

The teams that skip this step optimize endlessly for things that don’t matter while the problems that actually affect users go unnamed, and therefore unfixed.

The barrier isn’t technical. You already have the logs, the time, and the attention. The barrier is psychological.

Reading your system’s failures means confronting your system’s failures. In aggregate. Systematically. Without the rationalizations that make individual failures feel like edge cases.

There’s a second reason teams avoid it: the output of trace review doesn’t look like progress on a roadmap. Nobody celebrates “we read 200 traces and named five failure patterns.” There’s no demo, no new feature, no launch announcement. The work is invisible until the day someone asks “how do we know our system is working?” and one team can answer it and the other cannot.

Here’s why this matters beyond individual product quality: evaluation data is the actual moat.

Not model access — by 2026, frontier model access is a commodity. Every competitor can call the same APIs. What they can’t replicate is a labeled corpus of your specific production failures, at your scale, with your users, in your domain.

That corpus takes real production experience to generate. It captures the specific ways your use case diverges from general benchmarks. It becomes the foundation for every downstream improvement: better prompts, validated fine-tuning, automated evaluation that you trust because you built it from ground truth you generated yourself.

Every team with frontier model access has the same starting point on day one. The teams that build durable, improving systems are the ones that systematically turn their production failures into proprietary signal.

The teams that don’t are running their evaluation on vibes. They’ll get better when the model provider ships a better model — not because they learned anything.

This post expands on Chapter 9 of Wrong by Default: What AI Builders Know That Everyone Else Doesn’t by Alokit. Available on Kindle ($7.99): amazon.com/dp/B0GZCY9CGF

No posts