You ask ChatGPT about your company's internal policies. It makes something up. It sounds confident. It's wrong.
That's the hallucination problem. LLMs generate text based on what they learned during training. If the answer wasn't in the training data, they fabricate one that sounds plausible.
RAG (Retrieval Augmented Generation) fixes this. Before generating, the system retrieves relevant documents from your own knowledge base. The LLM reads those documents and generates an answer grounded in real content.
Your documents. Your data. Accurate answers.
What You'll Learn Here
- Why RAG beats fine-tuning for knowledge-heavy tasks
- The complete RAG pipeline: chunk, embed, retrieve, generate
- Chunking strategies that actually work
- Building RAG from scratch with sentence-transformers and a local LLM
- Building RAG with LangChain for real projects
- Evaluating RAG: what good looks like and what breaks it
- Common failure modes and how to fix them
RAG vs Fine-Tuning: When to Use Which
Both give LLMs access to new knowledge. They're solving different problems.
Fine-tuning:
- Best for: teaching style, format, behavior
- Updates model weights
- Needs retraining when data changes
- Can't cite sources easily
- Expensive to update frequently
RAG:
- Best for: factual knowledge, documents, databases
- No weight updates
- Update knowledge base anytime, instantly
- Can cite exact source passages
- Perfect for private or frequently changing data
Rule of thumb:
Behavior/style change → fine-tune
Knowledge/facts/documents → RAG
Both → fine-tune + RAG
The Complete RAG Pipeline
1. INDEXING (done once, offline)
Load documents
→ Split into chunks
→ Embed each chunk
→ Store in vector database
2. RETRIEVAL (done at query time)
User sends question
→ Embed the question
→ Find top-k similar chunks
→ Return chunks as context
3. GENERATION (done at query time)
Build prompt: question + retrieved chunks
→ Send to LLM
→ LLM generates answer grounded in chunks
→ Return answer to user
Step 1: Chunking Documents
The most underrated step. How you split documents dramatically affects retrieval quality.
import re
from typing import List
# Strategy 1: Fixed-size chunking
def chunk_fixed(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap # overlap to preserve context at boundaries
return chunks
# Strategy 2: Sentence-aware chunking (better)
def chunk_by_sentences(text: str, max_chunk_size: int = 500) -> List[str]:
# Split on sentence boundaries
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
chunks = []
current = ""
for sentence in sentences:
if len(current) + len(sentence) <= max_chunk_size:
current += " " + sentence if current else sentence
else:
if current:
chunks.append(current.strip())
current = sentence
if current:
chunks.append(current.strip())
return chunks
# Strategy 3: Paragraph-aware chunking (often best for structured docs)
def chunk_by_paragraphs(text: str, max_chunk_size: int = 800) -> List[str]:
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
chunks = []
current = ""
for para in paragraphs:
if len(current) + len(para) + 2 <= max_chunk_size:
current += "\n\n" + para if current else para
else:
if current:
chunks.append(current.strip())
current = para
if current:
chunks.append(current.strip())
return chunks
# Test on sample text
sample_text = """
Machine learning is a branch of artificial intelligence that enables computers to learn from data.
It has three main types: supervised, unsupervised, and reinforcement learning.
Supervised learning uses labeled examples where the correct answers are known.
The model learns to map inputs to outputs by minimizing error on training data.
Common algorithms include linear regression, decision trees, and neural networks.
Unsupervised learning finds patterns in data without labels.
Clustering algorithms group similar examples together.
Dimensionality reduction simplifies data while preserving structure.
Reinforcement learning trains an agent to take actions in an environment to maximize reward.
It learns through trial and error, receiving feedback from the environment.
Applications include game playing, robotics, and recommendation systems.
"""
chunks_fixed = chunk_fixed(sample_text, chunk_size=200, overlap=30)
chunks_sents = chunk_by_sentences(sample_text, max_chunk_size=300)
chunks_paras = chunk_by_paragraphs(sample_text, max_chunk_size=400)
print(f"Fixed chunks: {len(chunks_fixed)}")
print(f"Sentence chunks: {len(chunks_sents)}")
print(f"Paragraph chunks: {len(chunks_paras)}")
print(f"\nParagraph chunk 1:\n'{chunks_paras[0]}'")
print(f"\nParagraph chunk 2:\n'{chunks_paras[1]}'")
Output:
Fixed chunks: 7
Sentence chunks: 4
Paragraph chunks: 4
Paragraph chunk 1:
'Machine learning is a branch of artificial intelligence that enables computers to learn from data.
It has three main types: supervised, unsupervised, and reinforcement learning.'
Paragraph chunk 2:
'Supervised learning uses labeled examples where the correct answers are known.
The model learns to map inputs to outputs by minimizing error on training data.
Common algorithms include linear regression, decision trees, and neural networks.'
Chunking guidelines:
- Chunk size 300-600 characters works well for most use cases
- Always include overlap (50-100 chars) so context isn't lost at boundaries
- Paragraph chunking preserves semantic units better than fixed-size
- Smaller chunks: better precision (more specific retrieval)
- Larger chunks: better recall (more context per chunk)
Step 2: Building the Index
from sentence_transformers import SentenceTransformer
import chromadb
import numpy as np
# Knowledge base: a collection of ML documentation
knowledge_base = {
'doc1.txt': """
Linear regression predicts a continuous output variable from input features.
It fits a straight line (or hyperplane in multiple dimensions) through the data.
The model minimizes the mean squared error between predictions and true values.
The learned equation is: y = w1*x1 + w2*x2 + ... + b
Used for: house price prediction, sales forecasting, temperature prediction.
""",
'doc2.txt': """
Logistic regression is used for binary classification despite its name.
It applies a sigmoid function to the linear combination of features.
Output is a probability between 0 and 1.
The decision boundary is where the probability equals 0.5.
Used for: spam detection, disease diagnosis, fraud detection.
""",
'doc3.txt': """
Random forests combine many decision trees to reduce overfitting.
Each tree is trained on a random subset of data (bagging).
Each split considers a random subset of features.
Final prediction is the majority vote (classification) or average (regression).
Feature importance can be extracted from the forest.
""",
'doc4.txt': """
XGBoost builds trees sequentially, each one correcting errors from the previous.
It uses gradient boosting with regularization to prevent overfitting.
Learning rate controls how much each tree contributes.
Early stopping prevents overtraining.
Dominates Kaggle competitions on tabular data.
""",
'doc5.txt': """
Cross-validation gives a reliable estimate of model performance.
K-fold CV splits data into k equal parts, trains on k-1, tests on 1.
This is repeated k times with different test sets.
Average score across folds is the final estimate.
Prevents optimistic bias from a single train/test split.
""",
'doc6.txt': """
The confusion matrix shows all four prediction outcomes.
True positives: correctly predicted positive.
True negatives: correctly predicted negative.
False positives: incorrectly predicted positive (Type I error).
False negatives: incorrectly predicted negative (Type II error).
Precision = TP / (TP + FP). Recall = TP / (TP + FN).
""",
'doc7.txt': """
Overfitting occurs when a model performs well on training data but poorly on test data.
Signs: large gap between train and validation accuracy.
Causes: model too complex, too little data, training too long.
Fixes: regularization, dropout, more data, early stopping, simpler model.
The bias-variance tradeoff describes the fundamental tension.
""",
'doc8.txt': """
Transformers use self-attention to process sequences in parallel.
Self-attention computes relationships between all token pairs simultaneously.
Multi-head attention runs several attention operations in parallel.
Positional encoding adds position information to token embeddings.
Layer normalization and residual connections stabilize training.
""",
}
class RAGIndexer:
def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.chroma = chromadb.Client()
self.collection = self.chroma.create_collection(
name='rag_knowledge_base',
metadata={'hnsw:space': 'cosine'}
)
def index_documents(self, documents: dict, chunk_size: int = 400):
all_chunks = []
all_ids = []
all_meta = []
for doc_name, content in documents.items():
chunks = chunk_by_sentences(content, max_chunk_size=chunk_size)
for i, chunk in enumerate(chunks):
if len(chunk.strip()) < 30: # skip tiny chunks
continue
chunk_id = f"{doc_name}_chunk{i}"
all_chunks.append(chunk.strip())
all_ids.append(chunk_id)
all_meta.append({'source': doc_name, 'chunk_idx': i})
if not all_chunks:
return
# Encode all chunks
print(f"Encoding {len(all_chunks)} chunks...")
embeddings = self.model.encode(all_chunks, show_progress_bar=False)
# Add to ChromaDB
self.collection.add(
ids = all_ids,
documents = all_chunks,
embeddings = [e.tolist() for e in embeddings],
metadatas = all_meta
)
print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")
def retrieve(self, query: str, top_k: int = 3) -> List[dict]:
query_embedding = self.model.encode([query])[0].tolist()
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
retrieved = []
for doc, meta, dist in zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
):
retrieved.append({
'text': doc,
'source': meta['source'],
'similarity': 1 - dist # ChromaDB returns distance, not similarity
})
return retrieved
# Build the index
indexer = RAGIndexer()
indexer.index_documents(knowledge_base)
# Test retrieval
query = "How do I prevent a model from overfitting?"
results = indexer.retrieve(query, top_k=3)
print(f"\nQuery: '{query}'")
print("-" * 60)
for i, r in enumerate(results):
print(f"\n{i+1}. [{r['similarity']:.3f}] From: {r['source']}")
print(f" {r['text'][:150]}...")
Output:
Indexed 16 chunks from 8 documents
Query: 'How do I prevent a model from overfitting?'
------------------------------------------------------------
1. [0.712] From: doc7.txt
Overfitting occurs when a model performs well on training data but poorly on test data...
2. [0.531] From: doc4.txt
XGBoost builds trees sequentially, each one correcting errors from the previous...
3. [0.489] From: doc3.txt
Random forests combine many decision trees to reduce overfitting...
Step 3: Generation With Retrieved Context
# Using a local model via HuggingFace Transformers
from transformers import pipeline
# For a real project: use 'google/flan-t5-base' or connect to OpenAI API
generator = pipeline(
'text2text-generation',
model='google/flan-t5-base',
max_new_tokens=200
)
def build_rag_prompt(question: str, context_chunks: List[dict]) -> str:
context = "\n\n".join([
f"[Source: {c['source']}]\n{c['text']}"
for c in context_chunks
])
prompt = f"""Answer the question based only on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer this."
Context:
{context}
Question: {question}
Answer:"""
return prompt
class RAGPipeline:
def __init__(self, indexer: RAGIndexer, generator_pipeline):
self.indexer = indexer
self.generator = generator_pipeline
def answer(self, question: str, top_k: int = 3, verbose: bool = False) -> dict:
# Step 1: Retrieve relevant chunks
chunks = self.indexer.retrieve(question, top_k=top_k)
# Step 2: Build prompt
prompt = build_rag_prompt(question, chunks)
if verbose:
print("=== RETRIEVED CONTEXT ===")
for c in chunks:
print(f"[{c['source']}] sim={c['similarity']:.3f}: {c['text'][:100]}...")
print("\n=== PROMPT ===")
print(prompt[:500] + "...")
# Step 3: Generate answer
result = self.generator(prompt)[0]['generated_text']
return {
'question': question,
'answer': result.strip(),
'sources': [c['source'] for c in chunks],
'chunks': chunks
}
# Build the RAG pipeline
rag = RAGPipeline(indexer, generator)
# Ask questions
questions = [
"What causes overfitting and how do I fix it?",
"How is precision different from recall?",
"What makes XGBoost good for competitions?",
"How do transformers process sequences?",
]
for question in questions:
result = rag.answer(question)
print(f"\nQ: {question}")
print(f"A: {result['answer']}")
print(f"Sources: {result['sources']}")
print("-" * 60)
Using the OpenAI API for Better Generation
For production quality, use a real LLM API. The retrieval stays the same. Only the generation step changes.
# Replace the generator with OpenAI API
# pip install openai
import openai
def generate_with_openai(prompt: str, model: str = 'gpt-3.5-turbo') -> str:
client = openai.OpenAI() # reads OPENAI_API_KEY from environment
response = client.chat.completions.create(
model=model,
messages=[
{
'role': 'system',
'content': 'You are a helpful assistant. Answer questions based only on the provided context. If the context does not contain enough information, say so clearly.'
},
{
'role': 'user',
'content': prompt
}
],
temperature=0.1, # low temperature for factual answers
max_tokens=300
)
return response.choices[0].message.content
# Integrate into RAG pipeline
class RAGWithOpenAI:
def __init__(self, indexer: RAGIndexer):
self.indexer = indexer
def answer(self, question: str, top_k: int = 3) -> dict:
chunks = self.indexer.retrieve(question, top_k=top_k)
prompt = build_rag_prompt(question, chunks)
answer = generate_with_openai(prompt)
return {
'question': question,
'answer': answer,
'sources': list(set(c['source'] for c in chunks))
}
# rag_openai = RAGWithOpenAI(indexer)
# result = rag_openai.answer("What causes overfitting?")
print("OpenAI RAG pipeline ready (requires OPENAI_API_KEY)")
LangChain: RAG in 30 Lines
LangChain abstracts the entire RAG pipeline into composable components.
pip install langchain langchain-community langchain-chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline
# 1. Load documents
from langchain.schema import Document
docs = [
Document(
page_content=content,
metadata={'source': name}
)
for name, content in knowledge_base.items()
]
# 2. Split
splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50,
separators=['\n\n', '\n', '. ', ' ', '']
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")
# 3. Embed and store
embeddings = HuggingFaceEmbeddings(
model_name='sentence-transformers/all-MiniLM-L6-v2'
)
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={'k': 3})
# 4. Generation model
gen_pipe = hf_pipeline('text2text-generation', model='google/flan-t5-base', max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=gen_pipe)
# 5. Chain it together
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type='stuff', # stuff all chunks into one prompt
return_source_documents=True
)
# 6. Ask questions
result = qa_chain({'query': 'What causes overfitting?'})
print(f"Answer: {result['result']}")
print(f"Sources: {[d.metadata['source'] for d in result['source_documents']]}")
Common RAG Failure Modes and Fixes
failures = {
"Retrieval finds wrong chunks": {
"symptoms": "Answer is off-topic or doesn't address the question",
"causes": ["Chunk too large (contains many topics)", "Poor embedding model for domain"],
"fixes": ["Smaller chunks (200-400 chars)", "Domain-specific embedding model",
"Hybrid search (keyword + semantic)"]
},
"Chunks miss key information": {
"symptoms": "Model says 'I don't know' but answer is in the documents",
"causes": ["Chunk boundary cut the relevant sentence",
"top_k too small", "Query and document phrasing too different"],
"fixes": ["Add overlap between chunks", "Increase top_k to 5-7",
"Query expansion (rephrase query multiple ways and merge results)"]
},
"Model ignores retrieved context": {
"symptoms": "Answer doesn't match the retrieved chunks at all",
"causes": ["LLM is too small", "Prompt not clear about using only context"],
"fixes": ["Use larger/better LLM", "Stronger prompt instructions",
"Lower temperature"]
},
"Too much irrelevant context": {
"symptoms": "Model is confused, answer is vague",
"causes": ["top_k too high", "All chunks have low similarity scores"],
"fixes": ["Filter chunks below similarity threshold",
"Reduce top_k to 2-3", "Check if query is answerable"]
},
"Hallucination despite retrieval": {
"symptoms": "Model generates facts not in the retrieved context",
"causes": ["Model overrides context with training knowledge",
"Prompt not clear enough"],
"fixes": ["Explicit 'only use context' instruction in system prompt",
"Ask model to quote from context", "Use smaller, less opinionated LLM"]
}
}
for failure, info in failures.items():
print(f"\n{failure}")
print(f" Symptoms: {info['symptoms']}")
print(f" Fixes:")
for fix in info['fixes']:
print(f" - {fix}")
Evaluating RAG Quality
# Simple evaluation: does the answer contain key information?
def evaluate_rag_answer(answer: str, expected_keywords: List[str]) -> dict:
answer_lower = answer.lower()
found_keywords = [k for k in expected_keywords if k.lower() in answer_lower]
coverage = len(found_keywords) / len(expected_keywords)
return {
'coverage': coverage,
'found_keywords': found_keywords,
'missing_keywords': [k for k in expected_keywords if k not in found_keywords]
}
# Test cases
test_cases = [
{
'question': "What causes overfitting?",
'keywords': ['complex', 'training', 'gap', 'regularization']
},
{
'question': "How does cross-validation work?",
'keywords': ['k-fold', 'split', 'average', 'estimate']
},
]
print("RAG Evaluation Results:")
print("-" * 60)
for test in test_cases:
result = rag.answer(test['question'])
eval_ = evaluate_rag_answer(result['answer'], test['keywords'])
print(f"\nQ: {test['question']}")
print(f"A: {result['answer'][:150]}...")
print(f"Keyword coverage: {eval_['coverage']:.1%}")
print(f"Missing: {eval_['missing_keywords']}")
For production, use RAGAS (Retrieval Augmented Generation Assessment) which evaluates faithfulness, answer relevancy, and context precision automatically.
Quick Cheat Sheet
| Step | What it does | Key decision |
|---|---|---|
| Chunking | Split docs into pieces | Size 300-600 chars, overlap 50-100 |
| Embedding | Convert chunks to vectors | all-MiniLM-L6-v2 to start |
| Indexing | Store in vector DB | ChromaDB for dev, Pinecone for prod |
| Retrieval | Find top-k similar chunks | k=3 to 5 usually works |
| Generation | Build prompt + call LLM | Include retrieved context explicitly |
| Problem | Quick fix |
|---|---|
| Wrong chunks retrieved | Smaller chunks, better embedding model |
| Answer not in chunks | Add overlap, increase top-k |
| Model ignores context | Stronger prompt, lower temperature |
| Too slow | Smaller embedding model, FAISS ANN index |
| Hallucinations | Explicit "only use context" in system prompt |
Practice Challenges
Level 1:
Pick any 10 Wikipedia articles on a topic you know. Chunk them, embed them, and store in ChromaDB. Ask 5 questions where you already know the answer. Did RAG get them right?
Level 2:
Compare three chunking strategies (fixed-size, sentence-aware, paragraph-aware) on the same document set. For each strategy, retrieve the top-3 chunks for 5 queries. Which strategy retrieves more relevant chunks by eye?
Level 3:
Build a complete RAG pipeline with source citations. For each answer, show which document chunk it came from and highlight the specific sentence that grounded the answer. Add a similarity threshold: if the top-k chunks all score below 0.4, return "I don't have information about this" instead of guessing.
References
- Original RAG paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- LangChain RAG tutorial
- ChromaDB docs
- RAGAS: RAG evaluation framework
- LlamaIndex: RAG framework
Next up, Post 99: Build a Chatbot With Memory. Conversation history, context management, multi-turn dialogue. We build a chatbot that actually remembers what you said earlier in the conversation.

















