惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

Async Python for AI Applications: Patterns That Don't Break Under Load The Hidden Reason GRC Programs Keep Failing: It's a Design Problem, Not a People Problem An LLM API call, in 4 GIFs Fear not the Markdown: A Beginner's Quest 😱 [Boost] I built a search engine for 3 million Polish businesses — here's what I learned An Intelligence Briefing for the Port of Rotterdam, from a Single Prompt How I Built Semantic Discussion Clustering Without Embeddings (and Why It Was Good Enough) I Built a Real-Time Simulation Game in a Single HTML File (Without React or Custom JavaScript) I Got Tired of SNMP Dev Hell, So I Built Trishul SNMP Suite Why Getting a Tech Job Right Now Feels Broken? The Container Runtime Nobody Told You About (And Four Others) The Singleton Labyrinth Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes. Check Wallet Balances Across 4 Chains with Zero Dependencies — chain_balance.py Vectr — Code Intelligence AI Tool Veltrix Was Killing Us With YAML 5 PostgreSQL locking behaviors that trip people up Beyond Monolithic AI: How to Build a Pluggable "Brain" Architecture for Autonomous Agents The Operational Cost of JWT Lifecycle Management: Overlooked Details Mastering Structured JSON Outputs with Gemini API ATR Implements the Detection Layer the NSA Identified as Missing in MCP I tried both Cursor and Antigravity(1.20) - Switching Context - which one is better? Negative Lookups in Bf-Tree: Caching Things That Don't Exist My Struggles as a Software Engineer in 2026 Why Hybrid Metaheuristics Still Beat “Smarter” AI in Real-World Optimization Cómo destacar como JR DEV en tu equipo I got tired of guessing which model holds my VRAM, so I built a tiny dashboard Qwen Is Not Yet Ready to Power Local OpenClaw Deployments Top 7 Featured DEV Posts of the Week Why I got frustrated with AI job search tools and built my own 10 Best Open-Source AI Agents for 2026 Contract Analysis Will Replace Legal Gatekeeping AWS Cloud Shell with Antigravity CLI Building Reliable Event Delivery for XRPL Applications AMTP: HTTP for the Agentic Web — A New Markdown-First Protocol for AI Agents LLM Security Vulnerabilities Engineers Need to Know in 2026 Shared Build Cache: Makes Sense for the Independent Developer? Live Lessons From Running a 5-Minute Polymarket Crypto Bot Cómo Evaluar Agentes IA: Tutorial de LLM-as-Judge Day 2 of Python Learning 🐍 I built a local-first Apple Health recovery briefing that shows its math I Built a REST Microservice With a Database in 3 Files — and Wrote Zero Code 10 Avro Schema Mistakes Even Experienced Developer Do Commit: Refactor background workers and logging pipeline GitHub Actions vs Jenkins vs GitLab CI: A Developer's Honest Comparison (2026) Clean Architecture in MongoDB + C#: Why is the Repository Pattern Alone Not Enough? I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%. I Almost Quit Coding to Become a Welder Understanding Reinforcement Learning with Human Feedback Part 6: How the Reward Model Trains the Original Model # Level Up Your Portfolio with Wowfolio.in: Free, Customizable, Type Inhabitation in Lean: Why “Hello {name}” Can Become a Theorem Mastering Context in Go: A Senior Engineer’s Playbook for Lifecycle Management Solana Transactions Through a Backend Developer’s Eye Agent as a Tool Call: Claude Code's Fork-Exec Pattern How I wired Stripe subscriptions to Supabase in Next.js 15 (the parts tutorials skip) Introduction to A2A and Agent Search Why Doesn't Linux Break Every Week? The "AI" Label Is Losing Its Meaning, and Companies Are the Ones Diluting It Bucky Fuller's To-Do List: Can AI Finally Solve the World's Cataloged Problems? My $10/Month VPS Gets 659 SSH Attacks per Day — Here's What 4 Weeks of Running an Autonomous AI Has Taught Me About Infrastructure Speed Up Your WordPress Site in 30 Minutes: A No-Plugin Performance Guide Breaking Code: The Addiction Nobody in Tech Will Admit To Nobody Reads AI Safety Papers. But 649 People Upvoted a Letter to an LLM. The Pope wrote about me Je vibe-coded app werkt. Maar kan hij ook live? The Event Store That Survived Black Friday Without a Single 5xx Audit-trail-by-construction: a thesis for spec-driven AI coding Day 8 - Sparse embedding - RAG How we made our Mac launcher feel instant by killing slow providers How we made our Mac launcher feel instant by killing slow providers Enterprise AI Agent Orchestration Patterns How to build your first MCP server in 10 minutes Claude Code's plan mode is prompt engineering, not hard enforcement Built a C# AI Agent That Researches Errors and Suggests Fixes From Shell Scripts to MCP Servers: How SEO Broke My Brain (in a Good Way) AI Agent Platform Buyer's Guide: 12 Questions to Ask Before You Sign 🦋 I Built a Living Terminal Animation with Hermes Agent — Here's How It Went. AI Agents Are Coming for Your WordPress Admin Panel, and That's Not a Bad Thing Tailscale + k3s in a 2‑node homelab: why I use Tailscale ONLY for the control plane When NOT to Use AI Agents: A Realistic Framework Human-in-the-Loop Patterns for High-Stakes AI Agent Decisions LLM Cost Optimization for Agent Workflows: A Practical Guide An Evolving Strategy for Knowledge Work: From Human-In-the-Loop to Human-Before-the-Loop Why I Wake Up at 5am to Run (And Why You Might Want To) I Scanned 260 Packages that your are using and Found 43 With Security Vulnerabilities The Easiest Way to Implement Theme Toggling in React 19 using next-themes & Tailwind CSS v4 AI skill testing: yes, your prompts need regression tests Why We Built AnToAnt: Designing Software Before Writing Code How I Built an End-to-End HR Attrition Dashboard Using MySQL & Power BI Why Hytale Treasure Hunt Engines Stumble Before 1,000 Concurrent Diggers: What Veltrix Does Not Document How to Implement Dark/Light Mode with No Flickers in Next.js Building My First Solana Transfer CLI Tool | #100DaysOfSolana What Is OAuth Token Exchange? CLI wrapper for Cloudflare Tunnel with Zero Trust Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking The Death of the Junior Developer Is Greatly Exaggerated How I Built a Programmatic SEO Site with 16,750 Pages Using FastAPI and PostgreSQL Toward a Standard Model for Agent Memory I Applied SLA Concepts to My Email Inbox — Here's What I Learned Building the Chrome Extension
98. RAG: Give Your AI Access to Your Documents
Akhilesh · 2026-05-27 · via DEV Community

You ask ChatGPT about your company's internal policies. It makes something up. It sounds confident. It's wrong.

That's the hallucination problem. LLMs generate text based on what they learned during training. If the answer wasn't in the training data, they fabricate one that sounds plausible.

RAG (Retrieval Augmented Generation) fixes this. Before generating, the system retrieves relevant documents from your own knowledge base. The LLM reads those documents and generates an answer grounded in real content.

Your documents. Your data. Accurate answers.


What You'll Learn Here

  • Why RAG beats fine-tuning for knowledge-heavy tasks
  • The complete RAG pipeline: chunk, embed, retrieve, generate
  • Chunking strategies that actually work
  • Building RAG from scratch with sentence-transformers and a local LLM
  • Building RAG with LangChain for real projects
  • Evaluating RAG: what good looks like and what breaks it
  • Common failure modes and how to fix them

RAG vs Fine-Tuning: When to Use Which

Both give LLMs access to new knowledge. They're solving different problems.

Fine-tuning:
  - Best for: teaching style, format, behavior
  - Updates model weights
  - Needs retraining when data changes
  - Can't cite sources easily
  - Expensive to update frequently

RAG:
  - Best for: factual knowledge, documents, databases
  - No weight updates
  - Update knowledge base anytime, instantly
  - Can cite exact source passages
  - Perfect for private or frequently changing data

Rule of thumb:
  Behavior/style change → fine-tune
  Knowledge/facts/documents → RAG
  Both → fine-tune + RAG

Enter fullscreen mode Exit fullscreen mode


The Complete RAG Pipeline

1. INDEXING (done once, offline)
   Load documents
   → Split into chunks
   → Embed each chunk
   → Store in vector database

2. RETRIEVAL (done at query time)
   User sends question
   → Embed the question
   → Find top-k similar chunks
   → Return chunks as context

3. GENERATION (done at query time)
   Build prompt: question + retrieved chunks
   → Send to LLM
   → LLM generates answer grounded in chunks
   → Return answer to user

Enter fullscreen mode Exit fullscreen mode


Step 1: Chunking Documents

The most underrated step. How you split documents dramatically affects retrieval quality.

import re
from typing import List

# Strategy 1: Fixed-size chunking
def chunk_fixed(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    chunks = []
    start  = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap   # overlap to preserve context at boundaries
    return chunks

# Strategy 2: Sentence-aware chunking (better)
def chunk_by_sentences(text: str, max_chunk_size: int = 500) -> List[str]:
    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks    = []
    current   = ""

    for sentence in sentences:
        if len(current) + len(sentence) <= max_chunk_size:
            current += " " + sentence if current else sentence
        else:
            if current:
                chunks.append(current.strip())
            current = sentence

    if current:
        chunks.append(current.strip())

    return chunks

# Strategy 3: Paragraph-aware chunking (often best for structured docs)
def chunk_by_paragraphs(text: str, max_chunk_size: int = 800) -> List[str]:
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    chunks     = []
    current    = ""

    for para in paragraphs:
        if len(current) + len(para) + 2 <= max_chunk_size:
            current += "\n\n" + para if current else para
        else:
            if current:
                chunks.append(current.strip())
            current = para

    if current:
        chunks.append(current.strip())

    return chunks

# Test on sample text
sample_text = """
Machine learning is a branch of artificial intelligence that enables computers to learn from data. 
It has three main types: supervised, unsupervised, and reinforcement learning.

Supervised learning uses labeled examples where the correct answers are known. 
The model learns to map inputs to outputs by minimizing error on training data. 
Common algorithms include linear regression, decision trees, and neural networks.

Unsupervised learning finds patterns in data without labels. 
Clustering algorithms group similar examples together. 
Dimensionality reduction simplifies data while preserving structure.

Reinforcement learning trains an agent to take actions in an environment to maximize reward.
It learns through trial and error, receiving feedback from the environment.
Applications include game playing, robotics, and recommendation systems.
"""

chunks_fixed = chunk_fixed(sample_text, chunk_size=200, overlap=30)
chunks_sents = chunk_by_sentences(sample_text, max_chunk_size=300)
chunks_paras = chunk_by_paragraphs(sample_text, max_chunk_size=400)

print(f"Fixed chunks:     {len(chunks_fixed)}")
print(f"Sentence chunks:  {len(chunks_sents)}")
print(f"Paragraph chunks: {len(chunks_paras)}")

print(f"\nParagraph chunk 1:\n'{chunks_paras[0]}'")
print(f"\nParagraph chunk 2:\n'{chunks_paras[1]}'")

Enter fullscreen mode Exit fullscreen mode

Output:

Fixed chunks:     7
Sentence chunks:  4
Paragraph chunks: 4

Paragraph chunk 1:
'Machine learning is a branch of artificial intelligence that enables computers to learn from data. 
It has three main types: supervised, unsupervised, and reinforcement learning.'

Paragraph chunk 2:
'Supervised learning uses labeled examples where the correct answers are known. 
The model learns to map inputs to outputs by minimizing error on training data. 
Common algorithms include linear regression, decision trees, and neural networks.'

Enter fullscreen mode Exit fullscreen mode

Chunking guidelines:

  • Chunk size 300-600 characters works well for most use cases
  • Always include overlap (50-100 chars) so context isn't lost at boundaries
  • Paragraph chunking preserves semantic units better than fixed-size
  • Smaller chunks: better precision (more specific retrieval)
  • Larger chunks: better recall (more context per chunk)

Step 2: Building the Index

from sentence_transformers import SentenceTransformer
import chromadb
import numpy as np

# Knowledge base: a collection of ML documentation
knowledge_base = {
    'doc1.txt': """
        Linear regression predicts a continuous output variable from input features.
        It fits a straight line (or hyperplane in multiple dimensions) through the data.
        The model minimizes the mean squared error between predictions and true values.
        The learned equation is: y = w1*x1 + w2*x2 + ... + b
        Used for: house price prediction, sales forecasting, temperature prediction.
    """,
    'doc2.txt': """
        Logistic regression is used for binary classification despite its name.
        It applies a sigmoid function to the linear combination of features.
        Output is a probability between 0 and 1.
        The decision boundary is where the probability equals 0.5.
        Used for: spam detection, disease diagnosis, fraud detection.
    """,
    'doc3.txt': """
        Random forests combine many decision trees to reduce overfitting.
        Each tree is trained on a random subset of data (bagging).
        Each split considers a random subset of features.
        Final prediction is the majority vote (classification) or average (regression).
        Feature importance can be extracted from the forest.
    """,
    'doc4.txt': """
        XGBoost builds trees sequentially, each one correcting errors from the previous.
        It uses gradient boosting with regularization to prevent overfitting.
        Learning rate controls how much each tree contributes.
        Early stopping prevents overtraining.
        Dominates Kaggle competitions on tabular data.
    """,
    'doc5.txt': """
        Cross-validation gives a reliable estimate of model performance.
        K-fold CV splits data into k equal parts, trains on k-1, tests on 1.
        This is repeated k times with different test sets.
        Average score across folds is the final estimate.
        Prevents optimistic bias from a single train/test split.
    """,
    'doc6.txt': """
        The confusion matrix shows all four prediction outcomes.
        True positives: correctly predicted positive.
        True negatives: correctly predicted negative.
        False positives: incorrectly predicted positive (Type I error).
        False negatives: incorrectly predicted negative (Type II error).
        Precision = TP / (TP + FP). Recall = TP / (TP + FN).
    """,
    'doc7.txt': """
        Overfitting occurs when a model performs well on training data but poorly on test data.
        Signs: large gap between train and validation accuracy.
        Causes: model too complex, too little data, training too long.
        Fixes: regularization, dropout, more data, early stopping, simpler model.
        The bias-variance tradeoff describes the fundamental tension.
    """,
    'doc8.txt': """
        Transformers use self-attention to process sequences in parallel.
        Self-attention computes relationships between all token pairs simultaneously.
        Multi-head attention runs several attention operations in parallel.
        Positional encoding adds position information to token embeddings.
        Layer normalization and residual connections stabilize training.
    """,
}

class RAGIndexer:
    def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2'):
        self.model       = SentenceTransformer(model_name)
        self.chroma      = chromadb.Client()
        self.collection  = self.chroma.create_collection(
            name='rag_knowledge_base',
            metadata={'hnsw:space': 'cosine'}
        )

    def index_documents(self, documents: dict, chunk_size: int = 400):
        all_chunks = []
        all_ids    = []
        all_meta   = []

        for doc_name, content in documents.items():
            chunks = chunk_by_sentences(content, max_chunk_size=chunk_size)
            for i, chunk in enumerate(chunks):
                if len(chunk.strip()) < 30:   # skip tiny chunks
                    continue
                chunk_id = f"{doc_name}_chunk{i}"
                all_chunks.append(chunk.strip())
                all_ids.append(chunk_id)
                all_meta.append({'source': doc_name, 'chunk_idx': i})

        if not all_chunks:
            return

        # Encode all chunks
        print(f"Encoding {len(all_chunks)} chunks...")
        embeddings = self.model.encode(all_chunks, show_progress_bar=False)

        # Add to ChromaDB
        self.collection.add(
            ids        = all_ids,
            documents  = all_chunks,
            embeddings = [e.tolist() for e in embeddings],
            metadatas  = all_meta
        )
        print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")

    def retrieve(self, query: str, top_k: int = 3) -> List[dict]:
        query_embedding = self.model.encode([query])[0].tolist()

        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )

        retrieved = []
        for doc, meta, dist in zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ):
            retrieved.append({
                'text':       doc,
                'source':     meta['source'],
                'similarity': 1 - dist   # ChromaDB returns distance, not similarity
            })

        return retrieved

# Build the index
indexer = RAGIndexer()
indexer.index_documents(knowledge_base)

# Test retrieval
query   = "How do I prevent a model from overfitting?"
results = indexer.retrieve(query, top_k=3)

print(f"\nQuery: '{query}'")
print("-" * 60)
for i, r in enumerate(results):
    print(f"\n{i+1}. [{r['similarity']:.3f}] From: {r['source']}")
    print(f"   {r['text'][:150]}...")

Enter fullscreen mode Exit fullscreen mode

Output:

Indexed 16 chunks from 8 documents

Query: 'How do I prevent a model from overfitting?'
------------------------------------------------------------

1. [0.712] From: doc7.txt
   Overfitting occurs when a model performs well on training data but poorly on test data...

2. [0.531] From: doc4.txt
   XGBoost builds trees sequentially, each one correcting errors from the previous...

3. [0.489] From: doc3.txt
   Random forests combine many decision trees to reduce overfitting...

Enter fullscreen mode Exit fullscreen mode


Step 3: Generation With Retrieved Context

# Using a local model via HuggingFace Transformers
from transformers import pipeline

# For a real project: use 'google/flan-t5-base' or connect to OpenAI API
generator = pipeline(
    'text2text-generation',
    model='google/flan-t5-base',
    max_new_tokens=200
)

def build_rag_prompt(question: str, context_chunks: List[dict]) -> str:
    context = "\n\n".join([
        f"[Source: {c['source']}]\n{c['text']}"
        for c in context_chunks
    ])

    prompt = f"""Answer the question based only on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer this."

Context:
{context}

Question: {question}

Answer:"""

    return prompt

class RAGPipeline:
    def __init__(self, indexer: RAGIndexer, generator_pipeline):
        self.indexer   = indexer
        self.generator = generator_pipeline

    def answer(self, question: str, top_k: int = 3, verbose: bool = False) -> dict:
        # Step 1: Retrieve relevant chunks
        chunks = self.indexer.retrieve(question, top_k=top_k)

        # Step 2: Build prompt
        prompt = build_rag_prompt(question, chunks)

        if verbose:
            print("=== RETRIEVED CONTEXT ===")
            for c in chunks:
                print(f"[{c['source']}] sim={c['similarity']:.3f}: {c['text'][:100]}...")
            print("\n=== PROMPT ===")
            print(prompt[:500] + "...")

        # Step 3: Generate answer
        result = self.generator(prompt)[0]['generated_text']

        return {
            'question': question,
            'answer':   result.strip(),
            'sources':  [c['source'] for c in chunks],
            'chunks':   chunks
        }

# Build the RAG pipeline
rag = RAGPipeline(indexer, generator)

# Ask questions
questions = [
    "What causes overfitting and how do I fix it?",
    "How is precision different from recall?",
    "What makes XGBoost good for competitions?",
    "How do transformers process sequences?",
]

for question in questions:
    result = rag.answer(question)
    print(f"\nQ: {question}")
    print(f"A: {result['answer']}")
    print(f"Sources: {result['sources']}")
    print("-" * 60)

Enter fullscreen mode Exit fullscreen mode


Using the OpenAI API for Better Generation

For production quality, use a real LLM API. The retrieval stays the same. Only the generation step changes.

# Replace the generator with OpenAI API
# pip install openai

import openai

def generate_with_openai(prompt: str, model: str = 'gpt-3.5-turbo') -> str:
    client = openai.OpenAI()   # reads OPENAI_API_KEY from environment

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                'role': 'system',
                'content': 'You are a helpful assistant. Answer questions based only on the provided context. If the context does not contain enough information, say so clearly.'
            },
            {
                'role': 'user',
                'content': prompt
            }
        ],
        temperature=0.1,   # low temperature for factual answers
        max_tokens=300
    )

    return response.choices[0].message.content

# Integrate into RAG pipeline
class RAGWithOpenAI:
    def __init__(self, indexer: RAGIndexer):
        self.indexer = indexer

    def answer(self, question: str, top_k: int = 3) -> dict:
        chunks = self.indexer.retrieve(question, top_k=top_k)
        prompt = build_rag_prompt(question, chunks)
        answer = generate_with_openai(prompt)

        return {
            'question': question,
            'answer':   answer,
            'sources':  list(set(c['source'] for c in chunks))
        }

# rag_openai = RAGWithOpenAI(indexer)
# result = rag_openai.answer("What causes overfitting?")
print("OpenAI RAG pipeline ready (requires OPENAI_API_KEY)")

Enter fullscreen mode Exit fullscreen mode


LangChain: RAG in 30 Lines

LangChain abstracts the entire RAG pipeline into composable components.

pip install langchain langchain-community langchain-chroma

Enter fullscreen mode Exit fullscreen mode

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline

# 1. Load documents
from langchain.schema import Document

docs = [
    Document(
        page_content=content,
        metadata={'source': name}
    )
    for name, content in knowledge_base.items()
]

# 2. Split
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' ', '']
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

# 3. Embed and store
embeddings = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever   = vectorstore.as_retriever(search_kwargs={'k': 3})

# 4. Generation model
gen_pipe = hf_pipeline('text2text-generation', model='google/flan-t5-base', max_new_tokens=200)
llm      = HuggingFacePipeline(pipeline=gen_pipe)

# 5. Chain it together
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type='stuff',            # stuff all chunks into one prompt
    return_source_documents=True
)

# 6. Ask questions
result = qa_chain({'query': 'What causes overfitting?'})
print(f"Answer: {result['result']}")
print(f"Sources: {[d.metadata['source'] for d in result['source_documents']]}")

Enter fullscreen mode Exit fullscreen mode


Common RAG Failure Modes and Fixes

failures = {
    "Retrieval finds wrong chunks": {
        "symptoms": "Answer is off-topic or doesn't address the question",
        "causes":   ["Chunk too large (contains many topics)", "Poor embedding model for domain"],
        "fixes":    ["Smaller chunks (200-400 chars)", "Domain-specific embedding model",
                     "Hybrid search (keyword + semantic)"]
    },
    "Chunks miss key information": {
        "symptoms": "Model says 'I don't know' but answer is in the documents",
        "causes":   ["Chunk boundary cut the relevant sentence",
                     "top_k too small", "Query and document phrasing too different"],
        "fixes":    ["Add overlap between chunks", "Increase top_k to 5-7",
                     "Query expansion (rephrase query multiple ways and merge results)"]
    },
    "Model ignores retrieved context": {
        "symptoms": "Answer doesn't match the retrieved chunks at all",
        "causes":   ["LLM is too small", "Prompt not clear about using only context"],
        "fixes":    ["Use larger/better LLM", "Stronger prompt instructions",
                     "Lower temperature"]
    },
    "Too much irrelevant context": {
        "symptoms": "Model is confused, answer is vague",
        "causes":   ["top_k too high", "All chunks have low similarity scores"],
        "fixes":    ["Filter chunks below similarity threshold",
                     "Reduce top_k to 2-3", "Check if query is answerable"]
    },
    "Hallucination despite retrieval": {
        "symptoms": "Model generates facts not in the retrieved context",
        "causes":   ["Model overrides context with training knowledge",
                     "Prompt not clear enough"],
        "fixes":    ["Explicit 'only use context' instruction in system prompt",
                     "Ask model to quote from context", "Use smaller, less opinionated LLM"]
    }
}

for failure, info in failures.items():
    print(f"\n{failure}")
    print(f"  Symptoms: {info['symptoms']}")
    print(f"  Fixes:")
    for fix in info['fixes']:
        print(f"    - {fix}")

Enter fullscreen mode Exit fullscreen mode


Evaluating RAG Quality

# Simple evaluation: does the answer contain key information?
def evaluate_rag_answer(answer: str, expected_keywords: List[str]) -> dict:
    answer_lower   = answer.lower()
    found_keywords = [k for k in expected_keywords if k.lower() in answer_lower]
    coverage       = len(found_keywords) / len(expected_keywords)

    return {
        'coverage':         coverage,
        'found_keywords':   found_keywords,
        'missing_keywords': [k for k in expected_keywords if k not in found_keywords]
    }

# Test cases
test_cases = [
    {
        'question': "What causes overfitting?",
        'keywords': ['complex', 'training', 'gap', 'regularization']
    },
    {
        'question': "How does cross-validation work?",
        'keywords': ['k-fold', 'split', 'average', 'estimate']
    },
]

print("RAG Evaluation Results:")
print("-" * 60)
for test in test_cases:
    result = rag.answer(test['question'])
    eval_  = evaluate_rag_answer(result['answer'], test['keywords'])

    print(f"\nQ: {test['question']}")
    print(f"A: {result['answer'][:150]}...")
    print(f"Keyword coverage: {eval_['coverage']:.1%}")
    print(f"Missing: {eval_['missing_keywords']}")

Enter fullscreen mode Exit fullscreen mode

For production, use RAGAS (Retrieval Augmented Generation Assessment) which evaluates faithfulness, answer relevancy, and context precision automatically.


Quick Cheat Sheet

Step What it does Key decision
Chunking Split docs into pieces Size 300-600 chars, overlap 50-100
Embedding Convert chunks to vectors all-MiniLM-L6-v2 to start
Indexing Store in vector DB ChromaDB for dev, Pinecone for prod
Retrieval Find top-k similar chunks k=3 to 5 usually works
Generation Build prompt + call LLM Include retrieved context explicitly
Problem Quick fix
Wrong chunks retrieved Smaller chunks, better embedding model
Answer not in chunks Add overlap, increase top-k
Model ignores context Stronger prompt, lower temperature
Too slow Smaller embedding model, FAISS ANN index
Hallucinations Explicit "only use context" in system prompt

Practice Challenges

Level 1:
Pick any 10 Wikipedia articles on a topic you know. Chunk them, embed them, and store in ChromaDB. Ask 5 questions where you already know the answer. Did RAG get them right?

Level 2:
Compare three chunking strategies (fixed-size, sentence-aware, paragraph-aware) on the same document set. For each strategy, retrieve the top-3 chunks for 5 queries. Which strategy retrieves more relevant chunks by eye?

Level 3:
Build a complete RAG pipeline with source citations. For each answer, show which document chunk it came from and highlight the specific sentence that grounded the answer. Add a similarity threshold: if the top-k chunks all score below 0.4, return "I don't have information about this" instead of guessing.


References


Next up, Post 99: Build a Chatbot With Memory. Conversation history, context management, multi-turn dialogue. We build a chatbot that actually remembers what you said earlier in the conversation.