惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Apple Machine Learning Research
Apple Machine Learning Research
C
Cisco Blogs
P
Privacy & Cybersecurity Law Blog
T
Tor Project blog
Google Online Security Blog
Google Online Security Blog
Scott Helme
Scott Helme
C
Cyber Attacks, Cyber Crime and Cyber Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Hacker News - Newest:
Hacker News - Newest: "LLM"
N
News and Events Feed by Topic
The Register - Security
The Register - Security
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
SecWiki News
SecWiki News
T
True Tiger Recordings
T
The Exploit Database - CXSecurity.com
L
LINUX DO - 最新话题
Attack and Defense Labs
Attack and Defense Labs
S
Security @ Cisco Blogs
T
Troy Hunt's Blog
P
Palo Alto Networks Blog
T
Threat Research - Cisco Blogs
Simon Willison's Weblog
Simon Willison's Weblog
L
Lohrmann on Cybersecurity
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
阮一峰的网络日志
阮一峰的网络日志
IT之家
IT之家
J
Java Code Geeks
Hugging Face - Blog
Hugging Face - Blog
The Hacker News
The Hacker News
Jina AI
Jina AI
S
Secure Thoughts
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
爱范儿
爱范儿
月光博客
月光博客
S
Schneier on Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 【当耐特】
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
H
Hacker News: Front Page
Know Your Adversary
Know Your Adversary
PCI Perspectives
PCI Perspectives
罗磊的独立博客
A
Arctic Wolf
雷峰网
雷峰网
Hacker News: Ask HN
Hacker News: Ask HN
Google DeepMind News
Google DeepMind News
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Latest news
Latest news

DEV Community

Tree Traversal: Why the Order You Pick Is a Data Flow Decision Controlling Blender with AI — Building an MCP Server for 3D Creation 5 Smart Contract Vulnerabilities Every Developer Should Know in 2026 Cursor users who write failing tests before prompting the AI complete features in 37% fewer iterations than those who pr When AI Becomes a Danger: 370,000 Grok Conversations Exposed I Refactored 100 Functions With Claude. CI Was Green. Production Got Slower in 7 Spots. I read my own commits like a stranger Child Safety vs. Data Center Dollars The Reason Your AI Chatbot Feels Fast Has Nothing to Do With a Better Model Beyond Vibe-Coding What I learned testing AI translation tools in 2026 (DeepL is still good, but LLMs caught up) AWS ECS Fargate Cost Allocation: Why Your Per-Cluster Spend Shows as One Line How to Surface License Violations in GitHub Advanced Security with feluda We Deleted 10 Real Users with a Test-Cleanup Script — RCA The Decision Subtraction Framework: How to Evaluate Any AI Tool How I Access My Home PC From Anywhere Without Spending a Penny # agents.md: Teaching AI Agents How to Scrape (The Future of Web Automation) KAI vs Global vs Tojiro vs Miyabi: How to Actually Tell Japanese Knife Brands Apart Why We Accidentally Blocked Our Users: A Deep Dive into Idempotency in Distributed Systems I Connected Hermes Agent to a Live MCP Server with 59 Tools and Here's What It Actually Built Our first app is finally live on the Play Store after 4 months of hard work 🚀 I Built UUIDs That Look Random But Sort Like Timestamps (50% Smaller Indexes!) The Night Our Event Pipeline Crashed Because We Didn't Measure Memory First How to Control Token Spend in Codex-Style AI Workflows Understanding the Model Context Protocol (MCP): Complete Guide 185,000 Affected in 7-Eleven Breach: Why Salesforce Is the New Soft Target for ShinyHunters Hack your AWS CLI to add CloudShell support and turn your terminal into a bastion How to Check Telegram Account Age and Estimated Creation Date ChromaDB vs Qdrant vs Weaviate vs pgvector: vector database shootout 2026 Robinhood Just Launched AI Trading Agents — Here's the Economic Data API They Need Robinhood Just Launched AI Trading Agents — Here's the Economic Data API They Need Dhrishti Part 1 - Building Runtime Observability for Distributed Systems CSS Box Shadows: The Complete Guide From Flat to Floating When I Learned Python, I Made a CLI Tool I built a free API that measures the cost of software complexity My AI Agent Hit a Duplicate Post Error. Here Is the Engineering Lesson. How I Revived a Paused Agri-Tech App to Empower Farmers Using GitHub Copilot PostgreSQL 01003 오류 원인과 해결 방법 완벽 가이드 Introducing the UCP Playground Extension: An AI Shopping Agent in Your Side Panel Demystifying WebP to PNG: Secure Serverless Edge Routing Configurations Without Leaking Credentials Age Verification's Dirty Secret: The Tech Works. The System Doesn't. Tipos de errores, Wrapping e Inspección en Go The Next Decade of Data Engineering: From Modern Data Stack to Data Engineering Harness Tell me which LLM and cloud base suitable for creating agentic coding AI. it's all coverup the BMDA like 1. Business Understanding 2. Model / Architecture Design 3. Agile Development 4. Deployment & Monitoring Why Traditional QA Fails Browser-Based Casino Games I Built Sổ Lãi, a Practical Profit Tracker for Vietnamese Online Shops Bugs not dead: How to catch bugs in game code GitHub Suspended My 2-Year Developer Account — Here’s What I Learned April ecommerce grew at 11% - here's what that means for backend infrastructure Go Modules in Practice: Init, Tidy, Vendor, and Publishing Packages Building Metadata Capabilities in Apache SeaTunnel: A Committer’s Journey How to Correctly Read a PostgreSQL EXPLAIN ANALYZE Output label and Input Tag I Revived Intelliyash: A Local-First AI Builder for Low-End Machines How I Added dbt Cloud to Coral — My Open Source Hackathon Journey vens-action: reranking Trivy/Grype CVEs by real risk in CI Le projet qui fonctionnait… mais que je détestais modifier Magento 2 Static Content Deploy Optimization: Faster Builds, Fewer Headaches Top API Gateways for AI Applications and Agentic Workflows (2026 Developer Guide) Seasons time-lapse - alignment Struggle is part of mastery — stop skipping it We built a 5-level MLM referral system. 6 months, 6 users, $0 earned. Here's what we learned killing it Transforming XML to JSON and CSV with XSLT Building a Side Project with AI Pair Programming: Lessons Learned with Sharebox I Made Local AI Faster Than the Cloud — A Complete Home Automation Voice Control Journey An MCP server can vanish from your AI agent mid-conversation. Here's the 30-second timeout that did it to me. I Was Wrong About Events for Three Years—Until I Learned What Async Runtime Was Really Costing SleepPublish vs Zapier: Handling Your Heavy Auto Publish Tasks Mastering the print() Function in Python EIP-7928 parallelization, native privacy roadmap, EIP-8141 deep dive, EF restructuring Turning a Toaster Oven into a Reflow Oven — A Safety Design Story 20 Currency & Exchange Rate API Questions Answered (2026) — Exchange Rate API SurrealDB 3.1: stability, DiskANN, and a new release process Git Workflows: From Solo to Team (2026) Why Your OpenAI Wrapper Is Costing Too Much (And How LangGraph Fixes It) Veltrix and the Day the Trace Loops Broke Building an SEO crawler in TypeScript: what I learned Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance 82% of Phishing Attacks Are Now AI-Generated - And File Sharing Is a Key Attack Vector We Measured LLM Prompt Caching in Production — Same Prompt, 0% to 91% Hit Rates We gave Kiro a brain for AWS, locally, for free We Built an AI Voice Agent That Calls Real Estate Leads in Under 5 Minutes. Here's How I got tired of bloated reminder apps, so I built one in Java I Built a Fully Autonomous Social Media Agent in 72 Hours — Here's the Architecture 1 Minute SQL Tips with WoWSQL — 28 May 2026 Understanding known_hosts and Host Key Verification: What It Protects Against and How TOFU Works A-Z AI Glossary From a Forgotten Multiplayer Prototype to a Chaotic Hidden-Object Game — Reviving WhatUsee 🚀 Handling Localization in PCF Components: A Practical Walkthrough AI Agents Are Great at 80% of Our Code. The Other 20% Is Why We Still Need Seniors. How to Monitor AI Agents in Production I Analyzed 1,000 AI-Generated Blog Posts for Quality. Here's the Data. From Forgotten Repo to Live App: How I Finished Photremium.com Using GitHub Copilot Custodial vs trust-minimized: two settlement layers for the agent economy Treasure Hunting at Scale: Why Our Cache-Aside Cache Cost Us 40% in Tail Latency During Black Friday Designing Forms an AI Agent Can Actually Submit You’re Ignoring 95% of Your LLM Response From Abandoned Prototype to AI-Powered Google Form Platform Beginner’s AI Glossary PostgreSQL 01008 오류 원인과 해결 방법 완벽 가이드
Optimizing Chunking and Data Extraction for Zero-Hallucination RAG
AlterLab · 2026-05-28 · via DEV Community

TL;DR

To achieve near-zero hallucination in RAG pipelines, you must extract web content as structured Markdown or JSON rather than raw HTML, and apply DOM-aware semantic chunking. This preserves contextual boundaries and prevents irrelevant boilerplate or bot-challenge pages from poisoning your vector database.

Why Standard Web Scraping Breaks RAG Pipelines

Retrieval-Augmented Generation (RAG) relies entirely on the quality of the context provided to the LLM. If your retrieval system feeds the model fragmented, noisy, or irrelevant data, the LLM will hallucinate to fill in the semantic gaps.

Most engineering teams initially build RAG ingestion pipelines by blindly scraping public documentation, stripping HTML tags to get raw text, and splitting that text into arbitrary 1,000-token chunks. This approach guarantees hallucination for three reasons:

  1. Semantic Decapitation: Arbitrary token splitting frequently cuts concepts in half. A chunk might contain the arguments of a function but not the function signature itself.
  2. DOM Noise: Headers, footers, navigation sidebars, and cookie banners are embedded into the text stream. The vector database treats "Accept All Cookies" as equally semantically important as the actual documentation content.
  3. Context Poisoning: When scrapers get blocked by anti-bot systems, they often ingest the text of a CAPTCHA or "Access Denied" page. This poisons the vector space with irrelevant security warnings.

To fix this, we need to completely overhaul the ingestion pipeline from the extraction layer up.

Extracting Structured Data at the Source

Instead of extracting raw HTML and attempting to clean it locally, your scraping infrastructure should return pre-structured formats like Markdown. Markdown implicitly carries DOM hierarchy (headers, lists, tables) without the syntactic noise of HTML tags.

Below is how you configure a pipeline to extract clean, LLM-ready Markdown using AlterLab. Notice how we explicitly request Markdown format and enable JavaScript rendering to ensure we capture dynamically loaded content.

First, the standard HTTP approach:

```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/public-documentation",
"format": "markdown",
"render_js": true
}'




For production Python pipelines, you can use the [Python SDK](https://alterlab.io/web-scraping-api-python) to handle extraction synchronously within your ingestion workers. If you are setting up a new environment, reference the [quickstart guide](https://alterlab.io/docs/quickstart/installation) for installation prerequisites.



```python title="ingestion_worker.py" {4-8,11}

client = alterlab.Client("YOUR_API_KEY")

# Extract the page directly as clean, structured Markdown
response = client.scrape(
    url="https://example.com/public-documentation",
    format="markdown",
    render_js=True
)

# This content is now free of HTML tags, scripts, and CSS
clean_markdown = response.content 
print(clean_markdown)

Enter fullscreen mode Exit fullscreen mode

Semantic vs. Token-Based Chunking

Once you have clean Markdown, you must chunk it intelligently.

Standard LangChain or LlamaIndex token splitters use a rolling window of characters. If a code block spans 1,500 tokens but your chunk size is 1,000, the code block is split across two separate database entries. When a user queries the system, the vector similarity search might retrieve only the bottom half of the code block. The LLM, lacking the variable definitions from the top half, will hallucinate them.

Semantic chunking parses the Markdown syntax to split the document along structural boundaries—primarily headers (##, ###) and code blocks.

Implementing a Markdown-Aware Chunker

Here is a practical implementation of a chunker that respects Markdown structural boundaries, ensuring complete concepts are grouped together in single vectors.

```python title="semantic_chunker.py" {11-14,24-25}

def semantic_markdown_chunking(markdown_text, max_chunk_size=2000):
"""
Splits document based on H2 (##) and H3 (###) headers
to preserve semantic boundaries for vector search.
"""
chunks = []
current_chunk = []
current_length = 0

# Split by lines, but keep code blocks intact
lines = markdown_text.split('\n')
in_code_block = False

for line in lines:
    if line.startswith('```

Enter fullscreen mode Exit fullscreen mode


'):
in_code_block = not in_code_block

    # If we hit a new header and we aren't inside a code block, split.
    is_header = re.match(r'^#{2,3}\s', line)
    if is_header and not in_code_block and current_chunk:
        chunks.append('\n'.join(current_chunk))
        current_chunk = [line]
        current_length = len(line)
    else:
        current_chunk.append(line)
        current_length += len(line)

# Append the final chunk
if current_chunk:
    chunks.append('\n'.join(current_chunk))

return chunks

Enter fullscreen mode Exit fullscreen mode

Example Usage:

chunks = semantic_markdown_chunking(clean_markdown)

for chunk in chunks:

vector_db.upsert(embed(chunk))




This ensures that if a technical tutorial contains a step-by-step process under a specific `###` header, the entire process is embedded as a single vector. The LLM receives the complete thought, drastically reducing hallucination.

## Preventing Context Poisoning with Smart Rendering

The most insidious cause of RAG hallucination is vector database poisoning from failed data extraction. 

Many high-value public data sources (like financial records, API documentation, and e-commerce catalogs) sit behind aggressive CDN-level bot protection. If your scraping pipeline makes a raw `requests.get()` call, it will likely be served a 403 Forbidden page or a CAPTCHA challenge.

If your pipeline blindly vectorizes that 403 page, your RAG context is now polluted with text like "Please verify you are a human." When the LLM queries the database for "API rate limits," it might pull the CAPTCHA text due to overlapping security keywords, resulting in hallucinated, nonsensical answers.

Robust [anti-bot handling](https://alterlab.io/smart-rendering-api) built directly into the extraction layer ensures that your pipeline either receives the actual, rendered public content, or it receives a definitive HTTP 500/403 failure from the scraping API—which your pipeline can explicitly catch and discard, preventing bad data from ever reaching the vector database.

## Takeaway

Eliminating hallucination in RAG pipelines requires treating data extraction and chunking as semantic engineering tasks, not just data dumping. By shifting away from raw HTML and token-based splitting toward Markdown extraction and DOM-aware chunking, you provide the LLM with complete, structurally sound concepts. Coupling this with robust rendering layers ensures that your vector database remains a high-signal source of truth, free from bot-challenge noise and fragmented context.

Enter fullscreen mode Exit fullscreen mode