Optimizing Chunking and Data Extraction for Zero-Hallucination RAG

TL;DR

To achieve near-zero hallucination in RAG pipelines, you must extract web content as structured Markdown or JSON rather than raw HTML, and apply DOM-aware semantic chunking. This preserves contextual boundaries and prevents irrelevant boilerplate or bot-challenge pages from poisoning your vector database.

Why Standard Web Scraping Breaks RAG Pipelines

Retrieval-Augmented Generation (RAG) relies entirely on the quality of the context provided to the LLM. If your retrieval system feeds the model fragmented, noisy, or irrelevant data, the LLM will hallucinate to fill in the semantic gaps.

Most engineering teams initially build RAG ingestion pipelines by blindly scraping public documentation, stripping HTML tags to get raw text, and splitting that text into arbitrary 1,000-token chunks. This approach guarantees hallucination for three reasons:

Semantic Decapitation: Arbitrary token splitting frequently cuts concepts in half. A chunk might contain the arguments of a function but not the function signature itself.
DOM Noise: Headers, footers, navigation sidebars, and cookie banners are embedded into the text stream. The vector database treats "Accept All Cookies" as equally semantically important as the actual documentation content.
Context Poisoning: When scrapers get blocked by anti-bot systems, they often ingest the text of a CAPTCHA or "Access Denied" page. This poisons the vector space with irrelevant security warnings.

To fix this, we need to completely overhaul the ingestion pipeline from the extraction layer up.

Extracting Structured Data at the Source

Instead of extracting raw HTML and attempting to clean it locally, your scraping infrastructure should return pre-structured formats like Markdown. Markdown implicitly carries DOM hierarchy (headers, lists, tables) without the syntactic noise of HTML tags.

Below is how you configure a pipeline to extract clean, LLM-ready Markdown using AlterLab. Notice how we explicitly request Markdown format and enable JavaScript rendering to ensure we capture dynamically loaded content.

First, the standard HTTP approach:

```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/public-documentation",
"format": "markdown",
"render_js": true
}'




For production Python pipelines, you can use the [Python SDK](https://alterlab.io/web-scraping-api-python) to handle extraction synchronously within your ingestion workers. If you are setting up a new environment, reference the [quickstart guide](https://alterlab.io/docs/quickstart/installation) for installation prerequisites.



```python title="ingestion_worker.py" {4-8,11}

client = alterlab.Client("YOUR_API_KEY")

# Extract the page directly as clean, structured Markdown
response = client.scrape(
    url="https://example.com/public-documentation",
    format="markdown",
    render_js=True
)

# This content is now free of HTML tags, scripts, and CSS
clean_markdown = response.content 
print(clean_markdown)

Semantic vs. Token-Based Chunking

Once you have clean Markdown, you must chunk it intelligently.

Standard LangChain or LlamaIndex token splitters use a rolling window of characters. If a code block spans 1,500 tokens but your chunk size is 1,000, the code block is split across two separate database entries. When a user queries the system, the vector similarity search might retrieve only the bottom half of the code block. The LLM, lacking the variable definitions from the top half, will hallucinate them.

Semantic chunking parses the Markdown syntax to split the document along structural boundaries—primarily headers (##, ###) and code blocks.

Implementing a Markdown-Aware Chunker

Here is a practical implementation of a chunker that respects Markdown structural boundaries, ensuring complete concepts are grouped together in single vectors.

```python title="semantic_chunker.py" {11-14,24-25}

def semantic_markdown_chunking(markdown_text, max_chunk_size=2000):
"""
Splits document based on H2 (##) and H3 (###) headers
to preserve semantic boundaries for vector search.
"""
chunks = []
current_chunk = []
current_length = 0

# Split by lines, but keep code blocks intact
lines = markdown_text.split('\n')
in_code_block = False

for line in lines:
    if line.startswith('```

'):
in_code_block = not in_code_block

    # If we hit a new header and we aren't inside a code block, split.
    is_header = re.match(r'^#{2,3}\s', line)
    if is_header and not in_code_block and current_chunk:
        chunks.append('\n'.join(current_chunk))
        current_chunk = [line]
        current_length = len(line)
    else:
        current_chunk.append(line)
        current_length += len(line)

# Append the final chunk
if current_chunk:
    chunks.append('\n'.join(current_chunk))

return chunks

Example Usage:

chunks = semantic_markdown_chunking(clean_markdown)

for chunk in chunks:

vector_db.upsert(embed(chunk))




This ensures that if a technical tutorial contains a step-by-step process under a specific `###` header, the entire process is embedded as a single vector. The LLM receives the complete thought, drastically reducing hallucination.

## Preventing Context Poisoning with Smart Rendering

The most insidious cause of RAG hallucination is vector database poisoning from failed data extraction. 

Many high-value public data sources (like financial records, API documentation, and e-commerce catalogs) sit behind aggressive CDN-level bot protection. If your scraping pipeline makes a raw `requests.get()` call, it will likely be served a 403 Forbidden page or a CAPTCHA challenge.

If your pipeline blindly vectorizes that 403 page, your RAG context is now polluted with text like "Please verify you are a human." When the LLM queries the database for "API rate limits," it might pull the CAPTCHA text due to overlapping security keywords, resulting in hallucinated, nonsensical answers.

Robust [anti-bot handling](https://alterlab.io/smart-rendering-api) built directly into the extraction layer ensures that your pipeline either receives the actual, rendered public content, or it receives a definitive HTTP 500/403 failure from the scraping API—which your pipeline can explicitly catch and discard, preventing bad data from ever reaching the vector database.

## Takeaway

Eliminating hallucination in RAG pipelines requires treating data extraction and chunking as semantic engineering tasks, not just data dumping. By shifting away from raw HTML and token-based splitting toward Markdown extraction and DOM-aware chunking, you provide the LLM with complete, structurally sound concepts. Coupling this with robust rendering layers ensures that your vector database remains a high-signal source of truth, free from bot-challenge noise and fragmented context.

推荐订阅源

DEV Community