惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

n8n for Airtable Power Users: 5 Automations That Take Your Base to the Next Level Validating Gemma 4 for Industrial IoT: A Governance Pattern VS Code Now Credits Copilot on Every Commit by Default Astro and Islands Architecture: Why Your Portfolio Doesn't Need React for Everything Booting from FAT12: How I added file reading to my x86 kernel Unity’s AI agent went public: the developers of a static analysis tool on what that means for code quality Anna's Archive publica un llms.txt para los LLMs que rastrean su catálogo CRDTs for Offline-First Mobile Sync Why I Built Mneme HQ: Preventing AI Agent Architectural Drift Google Antigravity 2.0 Is the I/O 2026 Announcement You Should Actually Care About I Built a Pay-Per-Call Crypto Signal API with x402 — Heres the Architecture JWT Token Refresh Patterns in React 19: Avoiding the Silent Auth Death Spiral 🚀 “From Prompts to Autonomous Agents: What Google I/O 2026 Changed” The Power of Distributed Consensus in Autonomous SOCs Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern The Hardest Part of Being a Developer Isn't Coding Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it 25 React Interview Questions 2026 (With Answers) — Hooks, React 19, Concurrent Mode An open source LLM eval tool with two independent quality signals Using Dashboard Filtering to Get Customer Usage in Seconds from TBs of Data Skills, Java 17, And Theme Accents 4 Hard Lessons on Optimizing AI Coding Agents Arctype: Cross-Platform Database GUI for LLM Artifacts Your robots.txt says GPTBot is welcome. Your server says 403. Organizing How to Use AWS Glue Workflow 5 n8n Automations Every Digital Agency Should Be Running (Bill More, Work Less) Getting Started with TorchGeo — Remote Sensing with PyTorch Designing a Scalable Cross-Platform Appium Framework Google Antigravity 2.0 & Slash Commands Building a Unified Adaptive Learning Intelligence with Gemma 4, Flutter, and Multi-Model Orchestration Looking for beta testers for a £60 server management application The Disk-Pressure Incident That Taught Me to Always Set LimitRanges and Other Lessons from Mirroring EKS Locally. Why AI Should Not Write SQL Against ERP Databases Vibe coding works until it doesn't. The debt is real. Shipping at the Edge: Migrating a Coffee Subscription Platform to Cloudflare Workers Stop Tab-Switching: A Developer's Guide to Color Tools That Actually Fit the Workflow DevOps vs MLOps vs AIOps: What Changes, What Stays, and a Simple Roadmap to Get Started Run Powerful AI Coding Locally on a Normal Laptop 5 n8n Automations Every WooCommerce Store Needs (Save 10+ Hours/Week) What I Learned Building My Own AI Harness Hytale Servers Will Fail Treasure Hunts Until We Fix Our Event Handling Redux in React: Managing Global State Like a Pro Unfreezing Your GitHub Actions: Troubleshooting Stuck Deployments and Protecting Your Git Repo Statistics Unlocking Project Discoverability on GHES: A Key to Software Engineering Productivity When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Stop Calling It an AI Assistant. It’s Already Managing Your Company Why Hardcoded Automations Fail AI Agents Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run AI Is Changing Engineering Culture More Than We Realize Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine
86. RAG: Giving Language Models Long-Term Memory
Akhilesh · 2026-05-17 · via DEV Community

Large language models know a lot. They do not know everything.

They were trained on internet text up to a cutoff date. They have no idea what happened last week. They have no idea what is in your company's internal documentation. They have no idea what your customer support tickets say.

When asked about things outside their training, they have two choices: say they do not know, or hallucinate something plausible-sounding. They often choose the latter. Confidently. Wrongly.

RAG (Retrieval-Augmented Generation) solves this by giving the model access to an external knowledge base at query time. You retrieve relevant documents and include them in the prompt as context. The model reads the context, then answers based on what it finds there, not what it vaguely remembers from training.

The result: answers grounded in your specific, current, accurate knowledge base. Hallucinations drop dramatically. The model can cite its sources. The knowledge base updates without retraining the model.


The RAG Pipeline

import os
import re
import numpy as np
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)

print("RAG Pipeline Components:")
print()
print("1. DOCUMENT INGESTION")
print("   Load documents → chunk → embed → store in vector DB")
print()
print("2. RETRIEVAL")
print("   User query → embed → find top-k similar chunks")
print()
print("3. AUGMENTED GENERATION")
print("   context = retrieved chunks")
print("   prompt  = question + context")
print("   answer  = LLM(prompt)")
print()
print("The LLM never sees the full knowledge base.")
print("It only sees: the question + the retrieved context.")

Enter fullscreen mode Exit fullscreen mode


Building a Complete RAG System

class SimpleRAG:
    """A minimal but complete RAG implementation."""

    def __init__(self, embedding_model_name="all-MiniLM-L6-v2"):
        self.embed_model = SentenceTransformer(embedding_model_name)
        self.documents   = []
        self.embeddings  = None

    def add_documents(self, docs: List[Dict]):
        """Add documents with their metadata."""
        self.documents.extend(docs)
        texts = [d["text"] for d in docs]
        new_embeddings = self.embed_model.encode(texts, show_progress_bar=False)
        if self.embeddings is None:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
        print(f"  Added {len(docs)} documents. Total: {len(self.documents)}")

    def retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """Find top-k most similar documents to the query."""
        query_emb   = self.embed_model.encode([query])
        similarities = cosine_similarity(query_emb, self.embeddings)[0]
        top_indices  = np.argsort(similarities)[::-1][:top_k]
        results = []
        for idx in top_indices:
            doc = self.documents[idx].copy()
            doc["score"] = float(similarities[idx])
            results.append(doc)
        return results

    def build_prompt(self, query: str, retrieved: List[Dict]) -> str:
        """Build the augmented prompt with retrieved context."""
        context_parts = []
        for i, doc in enumerate(retrieved, 1):
            source = doc.get("source", f"Document {i}")
            context_parts.append(f"[{i}] Source: {source}\n{doc['text']}")

        context = "\n\n".join(context_parts)
        prompt  = f"""You are a helpful assistant. Answer the question based ONLY on the provided context.
If the answer is not in the context, say "I cannot find this information in the provided context."
Always cite the source numbers [1], [2], etc. when using information from the context.

Context:
{context}

Question: {query}

Answer:"""
        return prompt

    def answer(self, query: str, top_k: int = 3,
               llm_fn=None, verbose=False) -> Dict:
        """Full RAG pipeline: retrieve + generate."""
        retrieved  = self.retrieve(query, top_k=top_k)
        prompt     = self.build_prompt(query, retrieved)

        if verbose:
            print(f"Retrieved {len(retrieved)} documents:")
            for r in retrieved:
                print(f"  [{r['score']:.3f}] {r['text'][:60]}...")
            print()

        if llm_fn is not None:
            answer_text = llm_fn(prompt)
        else:
            answer_text = "[LLM not configured — see prompt below]"

        return {
            "query":     query,
            "retrieved": retrieved,
            "prompt":    prompt,
            "answer":    answer_text,
        }

rag = SimpleRAG()

knowledge_base = [
    {
        "text": "Our Q3 2024 revenue was $4.2 million, representing a 23% increase year-over-year. "
                "Subscription revenue grew 31% to $3.1 million. Professional services declined 5%.",
        "source": "Q3 Financial Report"
    },
    {
        "text": "The refund policy allows returns within 30 days of purchase. "
                "Digital products are non-refundable once downloaded. "
                "Hardware returns require original packaging.",
        "source": "Customer Service Policy"
    },
    {
        "text": "Our machine learning platform supports Python 3.8+. "
                "Required packages: torch>=2.0, transformers>=4.30, datasets>=2.0. "
                "GPU with 8GB+ VRAM recommended for fine-tuning.",
        "source": "Technical Documentation"
    },
    {
        "text": "The CEO is Sarah Chen, appointed in January 2023. "
                "CTO is James Park, with the company since 2019. "
                "The company was founded in 2018 and is headquartered in San Francisco.",
        "source": "Company Overview"
    },
    {
        "text": "Premium plan costs $49/month and includes unlimited API calls, "
                "priority support, and access to all models. "
                "Free tier allows 1000 API calls per month.",
        "source": "Pricing Page"
    },
    {
        "text": "Model training typically takes 2-4 hours for small datasets (< 10K examples) "
                "and 12-24 hours for large datasets (> 100K examples) on a single A100 GPU. "
                "Distributed training can reduce this by 4-8x.",
        "source": "Technical Documentation"
    },
    {
        "text": "To contact support: email support@company.com or use the in-app chat. "
                "Response time: 2 hours for premium, 24 hours for free tier. "
                "Enterprise clients have a dedicated account manager.",
        "source": "Support Guide"
    },
    {
        "text": "Q4 2024 product roadmap includes: LLaMA 3 integration (January), "
                "multi-modal support (February), AutoML features (March), "
                "and enterprise SSO (April).",
        "source": "Product Roadmap"
    },
]

print("Loading knowledge base into RAG system:")
rag.add_documents(knowledge_base)

Enter fullscreen mode Exit fullscreen mode


Testing the RAG System

test_queries = [
    "What was the Q3 revenue growth?",
    "Can I get a refund for a digital product?",
    "What Python version is required?",
    "Who is the CEO?",
    "How much does the premium plan cost?",
    "What is the support response time for free users?",
    "What new features are coming in February?",
]

print("\nTesting RAG retrieval:")
print("=" * 65)

for query in test_queries:
    result = rag.answer(query, top_k=2, verbose=False)
    print(f"\nQ: {query}")
    print(f"Top retrieved document:")
    top_doc = result["retrieved"][0]
    print(f"  Source: {top_doc['source']}  (score={top_doc['score']:.3f})")
    print(f"  Text:   {top_doc['text'][:80]}...")

print()
print("=" * 65)
print("\nSample prompt for 'What was the Q3 revenue growth?':")
result = rag.answer("What was the Q3 revenue growth?", top_k=2)
print(result["prompt"])

Enter fullscreen mode Exit fullscreen mode


Connecting to a Real LLM

print("\nConnecting RAG to a real LLM:")
print()
print("Option 1: OpenAI API")
openai_integration = """
import openai

client = openai.OpenAI(api_key="your_key")

def call_gpt(prompt):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,  # Low temperature for factual answers
        max_tokens=500
    )
    return response.choices[0].message.content

result = rag.answer("What was Q3 revenue?", llm_fn=call_gpt)
print(result["answer"])
"""
print(openai_integration)

print("Option 2: Anthropic Claude API")
claude_integration = """
import anthropic

client = anthropic.Anthropic(api_key="your_key")

def call_claude(prompt):
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

result = rag.answer("What was Q3 revenue?", llm_fn=call_claude)
print(result["answer"])
"""
print(claude_integration)

print("Option 3: Local LLM (Ollama)")
ollama_integration = """
import requests

def call_ollama(prompt, model="llama3"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

result = rag.answer("What was Q3 revenue?", llm_fn=call_ollama)
print(result["answer"])
"""
print(ollama_integration)

Enter fullscreen mode Exit fullscreen mode


Advanced RAG Techniques

print("Advanced RAG Techniques:")
print()
print("1. HYBRID SEARCH")
print("   Combine dense (embedding) and sparse (BM25/keyword) retrieval.")
print("   Dense: good at semantic understanding")
print("   Sparse: good at exact keyword matching")
print("   Hybrid: combine scores with alpha weighting")
print("   α=0: pure BM25, α=1: pure embedding, α=0.5: balanced")
print()

print("2. RERANKING")
print("   First-pass: fast retrieval with embeddings (top-20)")
print("   Second-pass: precise reranking with cross-encoder (top-3)")
print("   Cross-encoders jointly encode query+document for better scoring")
print("   Models: cross-encoder/ms-marco-MiniLM-L-6-v2 (fast)")
print()

reranker_code = """
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# First pass: fast retrieval
candidates = rag.retrieve(query, top_k=20)

# Second pass: precise reranking
pairs  = [(query, doc["text"]) for doc in candidates]
scores = reranker.predict(pairs)

# Sort by reranker score
reranked = sorted(zip(scores, candidates), key=lambda x: -x[0])
top_3    = [doc for _, doc in reranked[:3]]
"""
print(reranker_code)

print("3. QUERY REWRITING")
print("   Original query: 'How does it work?'")
print("   Problem: 'it' is ambiguous")
print("   Rewritten: 'How does the RAG pipeline retrieve documents?'")
print("   Use LLM to rewrite queries before embedding")
print()

print("4. MULTI-QUERY RETRIEVAL")
print("   Generate 3-5 variations of the original query")
print("   Retrieve for each variation")
print("   Deduplicate and merge results")
print("   Captures more relevant documents with diverse phrasings")
print()

print("5. PARENT-CHILD CHUNKING")
print("   Store large parent chunks (512 tokens)")
print("   Index small child chunks (128 tokens) for precise retrieval")
print("   When child is retrieved, return its parent for more context")
print("   Best of both worlds: precise matching, rich context")

Enter fullscreen mode Exit fullscreen mode


Evaluating RAG Quality

def evaluate_retrieval(rag_system, test_cases):
    """
    test_cases: list of dicts with 'query' and 'relevant_sources'
    """
    results = []
    for case in test_cases:
        retrieved = rag_system.retrieve(case["query"], top_k=3)
        retrieved_sources = [r["source"] for r in retrieved]
        relevant = case["relevant_sources"]

        hits = sum(1 for s in retrieved_sources if s in relevant)
        recall_at_3 = hits / len(relevant) if relevant else 0
        results.append({
            "query":       case["query"],
            "recall@3":    recall_at_3,
            "retrieved":   retrieved_sources,
            "relevant":    relevant,
        })
    return results

test_cases = [
    {"query": "Q3 revenue figures",      "relevant_sources": ["Q3 Financial Report"]},
    {"query": "refund for software",     "relevant_sources": ["Customer Service Policy"]},
    {"query": "Python requirements",     "relevant_sources": ["Technical Documentation"]},
    {"query": "company leadership team", "relevant_sources": ["Company Overview"]},
    {"query": "pricing monthly plan",    "relevant_sources": ["Pricing Page"]},
]

eval_results = evaluate_retrieval(rag, test_cases)

print("RAG Retrieval Evaluation:")
print(f"{'Query':<35} {'Recall@3':>10} {'Correct?':>9}")
print("=" * 58)
for r in eval_results:
    correct = "" if r["recall@3"] == 1.0 else ""
    print(f"{r['query']:<35} {r['recall@3']:>10.2f} {correct:>9}")

avg_recall = np.mean([r["recall@3"] for r in eval_results])
print(f"\nMean Recall@3: {avg_recall:.3f}")
print()
print("Evaluation frameworks for production RAG:")
print("  RAGAS: ragasframework.com — automated RAG evaluation")
print("  TruLens: trulens.org — LLM app evaluation")
print("  DeepEval: confident-ai.com — unit tests for LLM applications")

Enter fullscreen mode Exit fullscreen mode


A Resource Worth Reading

The original RAG paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Lewis et al. (2020) from Facebook AI introduces the concept and evaluates it on open-domain QA benchmarks. Short and readable. Shows the evidence that retrieval dramatically reduces hallucination. Search "Lewis retrieval-augmented generation knowledge-intensive NLP 2020."

Lilian Weng wrote "Large Language Model Based Agents" at lilianweng.github.io which covers RAG as part of the broader agent architecture landscape. Includes evaluation frameworks, advanced retrieval techniques, and production patterns. One of the most comprehensive technical posts on LLM applications available. Search "Lilian Weng large language model agents."


Try This

Create rag_practice.py.

Part 1: build a RAG system over a real document set. Use 20+ Wikipedia articles on a topic you know well (machine learning, history, sports, whatever). Load them, chunk into 512-character pieces with 64-character overlap, embed with all-MiniLM-L6-v2, store in a simple NumPy array.

Part 2: implement retrieval. Given 10 different queries, retrieve top 3 chunks. Print the query, the source document name, the similarity score, and the chunk text. Are the retrieved chunks relevant?

Part 3: connect to an LLM. Use either OpenAI, Anthropic, or Ollama. Build the full pipeline: question → retrieve → augment prompt → LLM → answer. Ask 5 questions. Compare the answers with and without RAG context. Do the RAG answers cite specific facts from the documents?

Part 4: test hallucination. Ask a question whose answer is NOT in your knowledge base. With a good RAG prompt, the LLM should say "I cannot find this information." Without RAG, the base model will likely hallucinate. Document both behaviors.


What's Next

RAG gives models knowledge. The next post puts it all together: building a complete chatbot with conversation memory, RAG, and a web interface. This is the practical application that everything in Phase 8 has been building toward.