86. RAG: Giving Language Models Long-Term Memory

Large language models know a lot. They do not know everything.

They were trained on internet text up to a cutoff date. They have no idea what happened last week. They have no idea what is in your company's internal documentation. They have no idea what your customer support tickets say.

When asked about things outside their training, they have two choices: say they do not know, or hallucinate something plausible-sounding. They often choose the latter. Confidently. Wrongly.

RAG (Retrieval-Augmented Generation) solves this by giving the model access to an external knowledge base at query time. You retrieve relevant documents and include them in the prompt as context. The model reads the context, then answers based on what it finds there, not what it vaguely remembers from training.

The result: answers grounded in your specific, current, accurate knowledge base. Hallucinations drop dramatically. The model can cite its sources. The knowledge base updates without retraining the model.

The RAG Pipeline

import os
import re
import numpy as np
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)

print("RAG Pipeline Components:")
print()
print("1. DOCUMENT INGESTION")
print("   Load documents → chunk → embed → store in vector DB")
print()
print("2. RETRIEVAL")
print("   User query → embed → find top-k similar chunks")
print()
print("3. AUGMENTED GENERATION")
print("   context = retrieved chunks")
print("   prompt  = question + context")
print("   answer  = LLM(prompt)")
print()
print("The LLM never sees the full knowledge base.")
print("It only sees: the question + the retrieved context.")

Building a Complete RAG System

class SimpleRAG:
    """A minimal but complete RAG implementation."""

    def __init__(self, embedding_model_name="all-MiniLM-L6-v2"):
        self.embed_model = SentenceTransformer(embedding_model_name)
        self.documents   = []
        self.embeddings  = None

    def add_documents(self, docs: List[Dict]):
        """Add documents with their metadata."""
        self.documents.extend(docs)
        texts = [d["text"] for d in docs]
        new_embeddings = self.embed_model.encode(texts, show_progress_bar=False)
        if self.embeddings is None:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
        print(f"  Added {len(docs)} documents. Total: {len(self.documents)}")

    def retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """Find top-k most similar documents to the query."""
        query_emb   = self.embed_model.encode([query])
        similarities = cosine_similarity(query_emb, self.embeddings)[0]
        top_indices  = np.argsort(similarities)[::-1][:top_k]
        results = []
        for idx in top_indices:
            doc = self.documents[idx].copy()
            doc["score"] = float(similarities[idx])
            results.append(doc)
        return results

    def build_prompt(self, query: str, retrieved: List[Dict]) -> str:
        """Build the augmented prompt with retrieved context."""
        context_parts = []
        for i, doc in enumerate(retrieved, 1):
            source = doc.get("source", f"Document {i}")
            context_parts.append(f"[{i}] Source: {source}\n{doc['text']}")

        context = "\n\n".join(context_parts)
        prompt  = f"""You are a helpful assistant. Answer the question based ONLY on the provided context.
If the answer is not in the context, say "I cannot find this information in the provided context."
Always cite the source numbers [1], [2], etc. when using information from the context.

Context:
{context}

Question: {query}

Answer:"""
        return prompt

    def answer(self, query: str, top_k: int = 3,
               llm_fn=None, verbose=False) -> Dict:
        """Full RAG pipeline: retrieve + generate."""
        retrieved  = self.retrieve(query, top_k=top_k)
        prompt     = self.build_prompt(query, retrieved)

        if verbose:
            print(f"Retrieved {len(retrieved)} documents:")
            for r in retrieved:
                print(f"  [{r['score']:.3f}] {r['text'][:60]}...")
            print()

        if llm_fn is not None:
            answer_text = llm_fn(prompt)
        else:
            answer_text = "[LLM not configured — see prompt below]"

        return {
            "query":     query,
            "retrieved": retrieved,
            "prompt":    prompt,
            "answer":    answer_text,
        }

rag = SimpleRAG()

knowledge_base = [
    {
        "text": "Our Q3 2024 revenue was $4.2 million, representing a 23% increase year-over-year. "
                "Subscription revenue grew 31% to $3.1 million. Professional services declined 5%.",
        "source": "Q3 Financial Report"
    },
    {
        "text": "The refund policy allows returns within 30 days of purchase. "
                "Digital products are non-refundable once downloaded. "
                "Hardware returns require original packaging.",
        "source": "Customer Service Policy"
    },
    {
        "text": "Our machine learning platform supports Python 3.8+. "
                "Required packages: torch>=2.0, transformers>=4.30, datasets>=2.0. "
                "GPU with 8GB+ VRAM recommended for fine-tuning.",
        "source": "Technical Documentation"
    },
    {
        "text": "The CEO is Sarah Chen, appointed in January 2023. "
                "CTO is James Park, with the company since 2019. "
                "The company was founded in 2018 and is headquartered in San Francisco.",
        "source": "Company Overview"
    },
    {
        "text": "Premium plan costs $49/month and includes unlimited API calls, "
                "priority support, and access to all models. "
                "Free tier allows 1000 API calls per month.",
        "source": "Pricing Page"
    },
    {
        "text": "Model training typically takes 2-4 hours for small datasets (< 10K examples) "
                "and 12-24 hours for large datasets (> 100K examples) on a single A100 GPU. "
                "Distributed training can reduce this by 4-8x.",
        "source": "Technical Documentation"
    },
    {
        "text": "To contact support: email support@company.com or use the in-app chat. "
                "Response time: 2 hours for premium, 24 hours for free tier. "
                "Enterprise clients have a dedicated account manager.",
        "source": "Support Guide"
    },
    {
        "text": "Q4 2024 product roadmap includes: LLaMA 3 integration (January), "
                "multi-modal support (February), AutoML features (March), "
                "and enterprise SSO (April).",
        "source": "Product Roadmap"
    },
]

print("Loading knowledge base into RAG system:")
rag.add_documents(knowledge_base)

Testing the RAG System

test_queries = [
    "What was the Q3 revenue growth?",
    "Can I get a refund for a digital product?",
    "What Python version is required?",
    "Who is the CEO?",
    "How much does the premium plan cost?",
    "What is the support response time for free users?",
    "What new features are coming in February?",
]

print("\nTesting RAG retrieval:")
print("=" * 65)

for query in test_queries:
    result = rag.answer(query, top_k=2, verbose=False)
    print(f"\nQ: {query}")
    print(f"Top retrieved document:")
    top_doc = result["retrieved"][0]
    print(f"  Source: {top_doc['source']}  (score={top_doc['score']:.3f})")
    print(f"  Text:   {top_doc['text'][:80]}...")

print()
print("=" * 65)
print("\nSample prompt for 'What was the Q3 revenue growth?':")
result = rag.answer("What was the Q3 revenue growth?", top_k=2)
print(result["prompt"])

Connecting to a Real LLM

print("\nConnecting RAG to a real LLM:")
print()
print("Option 1: OpenAI API")
openai_integration = """
import openai

client = openai.OpenAI(api_key="your_key")

def call_gpt(prompt):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,  # Low temperature for factual answers
        max_tokens=500
    )
    return response.choices[0].message.content

result = rag.answer("What was Q3 revenue?", llm_fn=call_gpt)
print(result["answer"])
"""
print(openai_integration)

print("Option 2: Anthropic Claude API")
claude_integration = """
import anthropic

client = anthropic.Anthropic(api_key="your_key")

def call_claude(prompt):
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

result = rag.answer("What was Q3 revenue?", llm_fn=call_claude)
print(result["answer"])
"""
print(claude_integration)

print("Option 3: Local LLM (Ollama)")
ollama_integration = """
import requests

def call_ollama(prompt, model="llama3"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

result = rag.answer("What was Q3 revenue?", llm_fn=call_ollama)
print(result["answer"])
"""
print(ollama_integration)

Advanced RAG Techniques

print("Advanced RAG Techniques:")
print()
print("1. HYBRID SEARCH")
print("   Combine dense (embedding) and sparse (BM25/keyword) retrieval.")
print("   Dense: good at semantic understanding")
print("   Sparse: good at exact keyword matching")
print("   Hybrid: combine scores with alpha weighting")
print("   α=0: pure BM25, α=1: pure embedding, α=0.5: balanced")
print()

print("2. RERANKING")
print("   First-pass: fast retrieval with embeddings (top-20)")
print("   Second-pass: precise reranking with cross-encoder (top-3)")
print("   Cross-encoders jointly encode query+document for better scoring")
print("   Models: cross-encoder/ms-marco-MiniLM-L-6-v2 (fast)")
print()

reranker_code = """
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# First pass: fast retrieval
candidates = rag.retrieve(query, top_k=20)

# Second pass: precise reranking
pairs  = [(query, doc["text"]) for doc in candidates]
scores = reranker.predict(pairs)

# Sort by reranker score
reranked = sorted(zip(scores, candidates), key=lambda x: -x[0])
top_3    = [doc for _, doc in reranked[:3]]
"""
print(reranker_code)

print("3. QUERY REWRITING")
print("   Original query: 'How does it work?'")
print("   Problem: 'it' is ambiguous")
print("   Rewritten: 'How does the RAG pipeline retrieve documents?'")
print("   Use LLM to rewrite queries before embedding")
print()

print("4. MULTI-QUERY RETRIEVAL")
print("   Generate 3-5 variations of the original query")
print("   Retrieve for each variation")
print("   Deduplicate and merge results")
print("   Captures more relevant documents with diverse phrasings")
print()

print("5. PARENT-CHILD CHUNKING")
print("   Store large parent chunks (512 tokens)")
print("   Index small child chunks (128 tokens) for precise retrieval")
print("   When child is retrieved, return its parent for more context")
print("   Best of both worlds: precise matching, rich context")

Evaluating RAG Quality

def evaluate_retrieval(rag_system, test_cases):
    """
    test_cases: list of dicts with 'query' and 'relevant_sources'
    """
    results = []
    for case in test_cases:
        retrieved = rag_system.retrieve(case["query"], top_k=3)
        retrieved_sources = [r["source"] for r in retrieved]
        relevant = case["relevant_sources"]

        hits = sum(1 for s in retrieved_sources if s in relevant)
        recall_at_3 = hits / len(relevant) if relevant else 0
        results.append({
            "query":       case["query"],
            "recall@3":    recall_at_3,
            "retrieved":   retrieved_sources,
            "relevant":    relevant,
        })
    return results

test_cases = [
    {"query": "Q3 revenue figures",      "relevant_sources": ["Q3 Financial Report"]},
    {"query": "refund for software",     "relevant_sources": ["Customer Service Policy"]},
    {"query": "Python requirements",     "relevant_sources": ["Technical Documentation"]},
    {"query": "company leadership team", "relevant_sources": ["Company Overview"]},
    {"query": "pricing monthly plan",    "relevant_sources": ["Pricing Page"]},
]

eval_results = evaluate_retrieval(rag, test_cases)

print("RAG Retrieval Evaluation:")
print(f"{'Query':<35} {'Recall@3':>10} {'Correct?':>9}")
print("=" * 58)
for r in eval_results:
    correct = "✓" if r["recall@3"] == 1.0 else "✗"
    print(f"{r['query']:<35} {r['recall@3']:>10.2f} {correct:>9}")

avg_recall = np.mean([r["recall@3"] for r in eval_results])
print(f"\nMean Recall@3: {avg_recall:.3f}")
print()
print("Evaluation frameworks for production RAG:")
print("  RAGAS: ragasframework.com — automated RAG evaluation")
print("  TruLens: trulens.org — LLM app evaluation")
print("  DeepEval: confident-ai.com — unit tests for LLM applications")

A Resource Worth Reading

The original RAG paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Lewis et al. (2020) from Facebook AI introduces the concept and evaluates it on open-domain QA benchmarks. Short and readable. Shows the evidence that retrieval dramatically reduces hallucination. Search "Lewis retrieval-augmented generation knowledge-intensive NLP 2020."

Lilian Weng wrote "Large Language Model Based Agents" at lilianweng.github.io which covers RAG as part of the broader agent architecture landscape. Includes evaluation frameworks, advanced retrieval techniques, and production patterns. One of the most comprehensive technical posts on LLM applications available. Search "Lilian Weng large language model agents."

Try This

Create rag_practice.py.

Part 1: build a RAG system over a real document set. Use 20+ Wikipedia articles on a topic you know well (machine learning, history, sports, whatever). Load them, chunk into 512-character pieces with 64-character overlap, embed with all-MiniLM-L6-v2, store in a simple NumPy array.

Part 2: implement retrieval. Given 10 different queries, retrieve top 3 chunks. Print the query, the source document name, the similarity score, and the chunk text. Are the retrieved chunks relevant?

Part 3: connect to an LLM. Use either OpenAI, Anthropic, or Ollama. Build the full pipeline: question → retrieve → augment prompt → LLM → answer. Ask 5 questions. Compare the answers with and without RAG context. Do the RAG answers cite specific facts from the documents?

Part 4: test hallucination. Ask a question whose answer is NOT in your knowledge base. With a good RAG prompt, the LLM should say "I cannot find this information." Without RAG, the base model will likely hallucinate. Document both behaviors.

What's Next

RAG gives models knowledge. The next post puts it all together: building a complete chatbot with conversation memory, RAG, and a web interface. This is the practical application that everything in Phase 8 has been building toward.

推荐订阅源

DEV Community