惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

LangGraph 워크플로우 템플릿 (v39) The git Commands You Forgot Exist (And Why AI Workflows Make Them Relevant Again) Six Levels of MCP Servers One container to replace Grafana + Loki + Tempo + Prometheus The Request/Response Cycle, HTTP, Auth, JWT, OAuth & Sessions — Explained Properly Python Week 3: We Stopped Repeating Ourselves (Loops!) Creating a Custom Grid Editor tool in Unreal Engine 我做了个付费 Telegram bot。Telegram Stars 实际给开发者多少钱,我算了一笔账。 A practitioner's guide to getting more value out of AI coding: agent quality & token optimization How to Handle Telegram Albums in Telegraf I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages How to Handle Telegram Albums in grammY RAG 시스템 실전 구축 (v38) Beyond Pip Install: Why Your AI Agent Needs a "Hermetic" Life-Support System to Survive Resume Building using HTML & CSS SpecFlow: Multi-Agent SDD in Cursor (4 phases, /approve, single code writer) Running ASR for smart homes in the NPU of Intel processors "Building a CI/CD Pipeline From Scratch: A Practical Guide for Developers (with GitHub Actions)" SpecFlow: SDD multi-agente en Cursor (4 fases, /approve, un solo escritor de código) How to Extract Your Full Team Hierarchy from HubSpot (the API doesn't expose it) Adobe Commerce Cloud now costs $40k/year. We migrated from Adobe Commerce to Magento Open Source — here's the honest breakdown .klickd v4.0.0 — Portable AI memory with constraints, strict schemas, and test vectors We Trust Third Party Code, It’s Time to Trust AI Generated Code LangGraph 워크플로우 템플릿 (v38) Sustainable AI Starts with Efficient AI Find Remove duplicated files in Google Drive How to Detect GPU Waste in a Kubernetes Cluster The Privacy Bug in My First Chrome Extension (And How to Avoid It) Serverless Mental Models: What They Don't Tell You Before You Build Preventing GPT hallucination in automated content pipelines: how I structure Make.com flows with data injection Hmm, where were we? AI Visibility Tools, Math Proofs, and Stripped Guardrails Shape Developer Landscape How AI and Electronics Are Changing Healthcare Devices: The Future of Smart Healthcare Author: Shivam Wakade | Founder, PrivSR Making Claude Sound Like Optimus Prime Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions Learning Progress Pt.20 How Secure LoRa Communication Devices Work: Building the Future of Private and Long-Range Connectivity Author: Shivam Wakade | Founder, PrivSR How I Rebuilt an RPG Map Editor with Rust, React, and WASM Building a System That Automates YouTube Post-Production Building a 100% Serverless Digital Asset Packager in the Browser Game Recommended AI What is Human-In-The-Loop (HITL)? Deep Dive: React Server Components in TanStack Start Migrating off Google Analytics: Umami vs Plausible vs Fathom Building a Portfolio That Actually Demonstrates Software Engineering Async/Await in JavaScript: From Callbacks to Clean Code (2026) Benchmarking LLM Structured Outputs Angular 21 Multiselect Dropdown: A Migration-Friendly Component with Live Functional Tests ShareBox v5 — GPU transcoding, Netflix-style grid, and why I don't need Plex anymore TOML Schema is live Handling Duplicate Shopify Webhook Events (And Why You Must) Original Kubernetes Dashboard — retired upstream, upgraded to Angular 21. لماذا أسست ترينافو للتجار العرب الذين تتجاهلهم المنصات الغربية Construyendo un recomendador de películas en Python: de los datos al modelo When APIs Lie: A Lesson in Defensive Debugging Pope Leo XIV's AI Encyclical: What Builders Must Know (2026) Donna v0.3.0 HTB — MonitorsFour | Writeup The Free Tool You Trust Is the One You Should Fear the Most HTB — MonitorsFour | Writeup Fr 97. Embeddings and Vector Search: Semantic Search That Works Deep Dive: Building "Gravity Paint" - A Tactile Physics Instrument with React, Matter.js, and p5.js ABAP Unit Testing with Test Doubles and Mocking Frameworks: A Senior Architects Guide to Isolating Dependencies in SAP S/4HANA LeetCode Solution: 5. Longest Palindromic Substring kovax-react 0.8: Tailwind v4 preset, FormField adapters, ColorModeScript, and Storybook I built an AI résumé tool that refuses to lie about your experience The hat Azure Entra ID User & Role Management — Step-by-Step Practical Guide With A Simple Excercise The AI-Native Company: How a Single Founder Can Build Global Organizations Powered by AWS and an Ecosystem of Artificial Intelligences Building a Lightweight Remote MCP Knowledge Base on Cloudflare Workers Why I built Trinavo for the MENA merchants Western platforms ignore The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Training Data Provenance: The Manifest Diff That Explains the Hash Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic
I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python
Ritika · 2026-05-26 · via DEV Community

Most hallucination detection approaches tell you to train another model. I did not want to do that. I used four statistical signals, a combined score, and a tunable threshold. No fine-tuning. No GPU. No external API. Tested on 10,000 real examples from the HaluEval dataset.
Soft flag result: precision 0.71, recall 0.96.
Strict flag result: precision 1.00, recall 0.38.
Here’s how it works.

Why Not Just Use a Model?

Approaches like SelfCheckGPT require multiple model samples and significant compute. That adds up fast when you are scoring thousands of answers a day. You also end up with a black box sitting on top of another black box. When something goes wrong, you have no idea which layer failed.
I wanted something where every flag has a reason you can actually read.

The Core Idea

Hallucination answers behave differently from grounded ones in ways you can measure. You do not need a model for this. You just need to look at the right things.
Four signals ended up doing most of the work.

Signal 1: Length Ratio
When a model does not know the answer, it pads. It generates more text to sound convincing instead of staying close to the facts.

df['answer_len'] = df['answer'].str.split().str.len() df['knowledge_len'] = df['knowledge'].str.split().str.len() df['length_ratio'] = df['answer_len'] / df['knowledge_len']

Enter fullscreen mode Exit fullscreen mode

Average length ratio: hallucinated 0.22 vs not hallucinated 0.05

Signal 2: Unknown Word Rate
Grounded answers stay close to the source. Hallucinated answers introduce words that never appeared in the reference text.

def unknown_word_rate(row): 
knowledge_words = set(str(row['knowledge']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not answer_words: 
    return 0 
unknown = answer_words - knowledge_words 
return len(unknown) / len(answer_words)

Enter fullscreen mode Exit fullscreen mode

Average unknown word rate: hallucinated 0.46 vs not hallucinated 0.30

Signal 3: Question-Answer Overlap
When a model fabricates, it often just echoes the question back. Instead of pulling from the source, it repeats the question words in the answer.

def question_answer_overlap(row): 
question_words = set(str(row['question']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not question_words: 
   return 0 
overlap = question_words & answer_words 
return len(overlap) / len(question_words)

Enter fullscreen mode Exit fullscreen mode

Average overlap: hallucinated 0.39 vs not hallucinated 0.02

Signal 4: Numeric Inconsistency
Numbers are where models hallucinate most confidently. The general concept might be right but the date, quantity, or statistic is just wrong.

def numeric_inconsistency(row): 
knowledge_nums = set(re.findall(r'\b\d+\b', str(row['knowledge']))) 
answer_nums = set(re.findall(r'\b\d+\b', str(row['answer']))) 
if not answer_nums: 
   return 0 
inconsistent = answer_nums - knowledge_nums
return len(inconsistent) / len(answer_nums)

Enter fullscreen mode Exit fullscreen mode

Average numeric inconsistency: hallucinated 0.087 vs not hallucinated 0.0001

Combining Into a Score

Each signal contributes one point if it crosses its threshold. Every answer gets a score from 0 to 4.

df['score'] = ( 
(df['length_ratio'] > 0.1).astype(int) + 
(df['unknown_word_rate'] > 0.4).astype(int) + 
(df['qa_overlap'] > 0.2).astype(int) + 
(df['numeric_inconsistency'] > 0.5).astype(int) 
)

Enter fullscreen mode Exit fullscreen mode

Not hallucinated answers cluster at 0 and 1. Hallucinated answers clustered at 2, 3, and 4.
Average score: hallucinated 2.18 vs not hallucinated 0.39

Two Thresholds Depending on Your Risk Tolerance

Soft flag (score >= 1): precision 0.71, recall 0.96 Use this when missing a hallucination costs more than a false alarm. Think financial services, healthcare, legal.
Strict flag (score >= 3): precision 1.00, recall 0.38 Use this when your review capacity is limited and you only want the obvious cases.
You can tune the threshold without retraining anything. That matters in production.
Plugging It In

def score_answer(knowledge, question, answer): 
knowledge_words = set(str(knowledge).lower().split()) 
answer_words = set(str(answer).lower().split()) 
question_words = set(str(question).lower().split()) 
knowledge_nums = set(re.findall(r'\b\d+\b', str(knowledge))) 
answer_nums = set(re.findall(r'\b\d+\b', str(answer))) 

answer_len = len(answer_words) 
knowledge_len = len(knowledge_words) if knowledge_words else 1 

length_ratio = answer_len / knowledge_len 
unknown_word_rate = len(answer_words - knowledge_words) / len(answer_words) if answer_words else 0 
qa_overlap = len(question_words & answer_words) / len(question_words) if question_words else 0 
numeric_inconsistency = len(answer_nums - knowledge_nums) / len(answer_nums) if answer_nums else 0 
score = ( 
                    int(length_ratio > 0.1) + 
        int(unknown_word_rate > 0.4) + 
        int(qa_overlap > 0.2) + 
        int(numeric_inconsistency > 0.5) 
) 
return score

score = score_answer(knowledge, question, answer) 
if score >= 3: 
action = "block" 
elif score >= 1: 
action = "flag" 
else: 
 action = "pass" 

Enter fullscreen mode Exit fullscreen mode

runs in milliseconds. No model to load, no GPU, no API call. Log the score and individual signal values for every answer. Over time that becomes your calibration dataset.

Real Examples

Hallucinated, score 3/4
Question: What U.S. highway gives access to Zilpo Road, and is also known as Midland Trail? Answer: It's actually Zilpo Road that is known as Midland Trail, not US 60.
The model deflected and contradicted the source instead of answering. Caught.
Hallucinated, score 3/4
Question: Dua Lipa's debut album spawned "New Rules" — in what year was it released? Answer: The album was released in 2018.
The correct year is 2017. Confident, wrong, numeric flag caught it.
Not hallucinated, score 0/4
Question: The Dutch-Belgian series "House of Anubis" was based on — first aired in what year? Answer: 2006.
Correct, grounded, one word. Score zero.

Limitations Worth Knowing

This only works if you have source knowledge to compare against. It does not apply to open-ended generation without a retrievable source. Best fit is RAG pipelines and QA systems.
It uses word-level matching, not semantic understanding. A hallucination that paraphrases the source closely might slip through. The thresholds were tuned on HaluEval so if you are working in a specialized domain, recalibrate on your own data first.
Precision of 0.71 on the soft flag means about 3 in 10 flags are false alarms. That is a tradeoff, not a flaw. Monitor it.

Final Thought

AI produces what it receives. If the outputs are not being validated, you will not know what you are getting. This framework is one way to start checking without adding a lot of infrastructure.
Full code on GitHub: github.com/ritikade2/llm-hallucination-detector