惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Enterprise AI Without an Enterprise Budget Turning You Into a Power User: Hybrid Memory, SSH Cloak, and Password Vaulting With VEKTOR Fixing the session timeouts Beyond Autonomous AI: Understanding Self-Healing Agents in Enterprise AI Systems MCP Is the AI Platform Camera2 API: Handling Orientation, Focus, and Exposure in Background — How to Keep Your Android Camera Running With the Screen Off I built a free Bitly/TinyURL alternative and self-hosted it on a $6/mo VPS — here's the full stack Design to Code #7: How CVA Scaffolding Turned Into Dead Code Stop rebuilding memory and orchestration for every AI agent you build 6 users in one day with zero marketing budget — what actually worked How a photo-blind dating engine actually ranks people (the TypeScript) AI Is Moving From Your Pocket to Your Brain — The 6-Year Timeline I Built a Static Blog Generator in 350 Lines of Python — No Dependencies, No Config, No Nonsense How Does Duolingo Monetize? I Decompiled the Android App (v6.79.5) Next.js Dynamic OG Images: Fix the Turbopack CPU Hang AI Is Turning Every Developer Into an Architect What is props 3 Things Building MediTrack Taught Me About Laravel Vibe Coding: My Daily Workflow with Claude Code Using Python to Do the Wonders: How Flet Changes the Game for Developers OpenDev: From Zero Clients to Linux Independence – How I'm Building a One-Man Linux Revolution Migrating from Jest to Vitest 4: A Complete 2026 Guide Making Equation (2.2) of the OpenAI Erdős Result Executable HTTP request headers: canonical reference Prefix caching in vLLM under multi-tenant agent traffic Introducing Oracle Support in Dory How I built 3 products solo as a CA student using AI — no coding background What is AEO? How to Get ChatGPT, Perplexity & AI Search Engines to Cite Your Website — 2026 Guide HTTP rate-control headers: canonical reference Im attending Manifest 2026! AI Music Doesn’t Need Better Prompts — It Needs Better Systems ORA-00215 오류 원인과 해결 방법 완벽 가이드 Stop Making Your AI Chatbot Slower: Streaming Responses with Spring AI and Server-Sent Events Annotations in Spring Boot What is the Model Context Protocol (MCP)? Gemini CLI Skills: Teaching Your Terminal Agent How to Think 🧠 What the Heck is an API? FairLens AI: An Intelligent Dashboard for Automated Bias Auditing RAG vs Fine-Tuning- Choosing Right Strategy for Modern AI Applications AI Metrics Decoded: From Parameters to TOPS I made git merge finish itself — in VS Code, in my terminal, and in CI You just can’t miss this… Redis Essentials: Architecture, Caching, and Setup Docker with AI: A Practical Guide to Running LLMs, Agents and MCP Design to Code #5: Using AI to Build a Design System Analyzing 1,000 Engineering Problems Through GitHub Data Open Graph protocol: canonical reference How a 400-Engineer SaaS Company Cut PR-to-Production from 4.2 Days to 6.4 Hours with Claude Code Multi-Agent DevOps 💬 Embedded AI Chatbots vs Popup Bubbles — Which One Creates Better Engagement? Bajándole todos los minutos posibles al CI del backend con mas de 1000 tests Harness Engineering: Stop Re-Prompting Your Coding Agent Every Session HTML meta referrer: canonical reference AWS MCP Server Just Gave AI Agents Your Cloud Keys — Here's Why That Should Worry You Announcing the Trust Identity Protocol (TIP): HTTPS for the AI Era We built the feature in two days. Making it reliable took two weeks. LuisCore /for-agents.json — agent bootstrap — daily syndication · 2026-05-26 A Curious Journey Into Reverse Engineering an AI-Generated Python .exe Part 2: Enterprise Decision Intelligence Architecture: AI Governance, Threshold Policy Engines, and Operational AI Systems I will continue using Devise with Rails 8! The Developer's Guide to Picking the Right AI Code Model in 2026 (I Spent $500 So You Don’t Have To) 30 Kubernetes Tasks Every CKA Candidate Should Practice Before Exam Day Why Some Websites Feel Instantly Better to Use Advanced React Patterns I Wish I Knew 5 Years Ago ¿Cómo optimizar algoritmos en arreglos y listas con la técnica de dos punteros? I scanned 8 popular open source repos with one command. Here's what I found. mcp-probe v1.6.0: Stricter GitHub Actions checks for MCP CI gates How we connect two strangers' webcams fast (and keep the TURN bill small) LLM Agents Are Now Finding Zero-Days: How AI is Autonomously Rewriting the Rules of Vulnerability Research Minimal Code Doesn’t Mean Stable Code How I manage 40+ skills across Claude Code, Codex, and .agents folders Hardening Stealth Browser Fingerprint Integrity and State Persistence Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide How I Slashed My AI API Bill by 95% — A Practical Guide for 2026 A Go outbox library that runs inside your own DB transaction How I Built a Credit Optimizer That Saves 30-75% on AI Agent Costs (Open Architecture) The Missing POP: How I Ported a Yul Contract to Huff by Reading Every Opcode The Moment the Config Parser Became the Bottleneck Churn Tool Stack by Revenue Stage ($5K to $50K+) What I Learned Exploring AI-Generated 3D: A Hands-On Tour of Meshy, Tripo, and Three.js Day 15 - Software Composition Analysis(SCA) Contributing Upstream Instead of Forking: My grape-swagger-rails Story Behind The Badge: How We Built 2,000 Hackable Badges For Temporal Replay Access Control Doesn't Scale Linearly -- Part 3 33x faster than Rust: Why I stopped waiting for my compiler and built my own. I Built My First Production AWS Project as a Career Changer Why Detecting PII Matters More Than Ever JSON Schema in 10 Minutes — Validation, Types & Real Examples Python Tasks How I Started My Cybersecurity Journey as an SQA Engineer 🔐 Why "fancy fonts" in Discord and Instagram bios turn into boxes ☁️ GKE private cluster setup — common mistakes and how to avoid them I Thought a Username Didn’t Matter… Until I Saw How Much People Care About It Claude for Small Business: 382K Day-One Buyer's Guide I Built a Diagnostic Toolkit for PyTorch Because I Was Tired of Guessing Why Models Fail How I Built an AI-Powered Incident RCA Platform with LangGraph and RAG The Paywall Was a Painted Door Sonnet hallucinated. My agent stored it as fact. How React-Style Time-Slicing Keeps UIs Responsive 这个 Princeton 开源项目让 AI 自己修 Bug,19K Stars 但 90% 的人只用了 1% 功能 🔥
Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀
Sridhar S · 2026-05-26 · via DEV Community

Beyond Basic RAG: Learn LangChain + RAG End-to-End 🚀


Introduction

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.

Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:

Hallucination

Hallucination means:

The model confidently generates incorrect information.

Example:

Question:

Who is the CEO of my company?

Without access to your internal company data, an LLM may generate a completely wrong answer.

This is where RAG (Retrieval-Augmented Generation) becomes useful.

Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.


What is RAG?

RAG stands for:

Retrieval-Augmented Generation

Instead of:

Question → LLM → Answer

Enter fullscreen mode Exit fullscreen mode

We do:

Question
   ↓
Retrieve Relevant Documents
   ↓
Provide Context to LLM
   ↓
Generate Grounded Response

Enter fullscreen mode Exit fullscreen mode

This makes responses:

✅ More accurate

✅ Context-aware

✅ Less hallucinated

✅ Enterprise-ready


Complete RAG Architecture

Documents (PDFs, DOCX, TXT)
            ↓
      Document Loading
            ↓
         Chunking
            ↓
         Embeddings
            ↓
      Vector Database
            ↓
      Similarity Search
            ↓
         Reranking
            ↓
       Context Building
            ↓
            LLM
            ↓
         Final Answer
            ↓
     Monitoring & Evaluation

Enter fullscreen mode Exit fullscreen mode


Required Installation

Before starting, install all dependencies.

pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv

Enter fullscreen mode Exit fullscreen mode


Project Structure

project/
│
├── data/
│   ├── pdf/
│   └── text/
│
├── .env
├── rag_pipeline.py
└── requirements.txt

Enter fullscreen mode Exit fullscreen mode


Environment Variables (.env)

Never hardcode API keys.

Create a .env file.

NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com

Enter fullscreen mode Exit fullscreen mode


1. Understanding LangChain Document Structure

LangChain stores documents in a standardized format.

A document contains:

  1. page_content
  2. metadata

page_content

This contains actual text.

Example:

page_content = "Generative AI is growing rapidly."

Enter fullscreen mode Exit fullscreen mode


metadata

Metadata stores additional information.

Examples:

  • file name
  • author
  • created date
  • source
  • page number

Creating a LangChain Document

Import

from langchain_core.documents import Document

Enter fullscreen mode Exit fullscreen mode

Code

from langchain_core.documents import Document

doc = Document(
    page_content="""
    Generative AI is a subset of Artificial Intelligence
    focused on creating content.
    """,
    metadata={
        "source": "genai.pdf",
        "author": "Sridhar",
        "pages": 10
    }
)

print(doc)

Enter fullscreen mode Exit fullscreen mode

Output

Document(
    page_content='Generative AI...',
    metadata={
        'source': 'genai.pdf',
        'author': 'Sridhar',
        'pages': 10
    }
)

Enter fullscreen mode Exit fullscreen mode

Why metadata matters?

In enterprise AI:

You often want:

“Show answer from document X page 5”

Metadata helps with traceability.


2. Loading Documents

Before processing documents, we must load them.

LangChain provides multiple loaders.


TextLoader

Used for:

  • .txt files
  • plain text files

Import

from langchain_community.document_loaders import TextLoader

Enter fullscreen mode Exit fullscreen mode

Example

loader = TextLoader(
    "data/text/sample.txt",
    encoding="utf-8"
)

documents = loader.load()

print(documents)

Enter fullscreen mode Exit fullscreen mode


DirectoryLoader

Loads multiple files from a folder.

Useful when:

You have:

100 PDFs
50 TXT files
many documents

Enter fullscreen mode Exit fullscreen mode

Import

from langchain_community.document_loaders import DirectoryLoader

Enter fullscreen mode Exit fullscreen mode

Example

loader = DirectoryLoader(
    "data/text",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={
        "encoding":"utf-8"
    }
)

documents = loader.load()

print(documents)

Enter fullscreen mode Exit fullscreen mode


PDF Loader

Most enterprise RAG systems use PDFs.

LangChain supports:

PyPDFLoader

Simple and fast.

Import

from langchain_community.document_loaders import PyPDFLoader

Enter fullscreen mode Exit fullscreen mode

Example

loader = PyPDFLoader(
    "data/pdf/rag_guide.pdf"
)

documents = loader.load()

print(documents[0])

Enter fullscreen mode Exit fullscreen mode

Each page becomes:

Document(
    page_content="Page text",
    metadata={"page":1}
)

Enter fullscreen mode Exit fullscreen mode


3. Chunking Documents

Chunking is one of the most important parts of RAG.

Why?

Because LLMs have token limits.

You cannot send:

500 page PDF

Enter fullscreen mode Exit fullscreen mode

to GPT.

Instead:

We split documents into smaller chunks.


Why Chunking Matters?

Bad chunking causes:

❌ poor retrieval

❌ hallucination

❌ context loss

Good chunking improves:

✅ retrieval quality

✅ relevance

✅ accuracy


RecursiveCharacterTextSplitter

Most commonly used splitter.

Import

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter
)

Enter fullscreen mode Exit fullscreen mode

Code

text_splitter = (
    RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",
            "\n",
            " ",
            ""
        ]
    )
)

chunks = text_splitter.split_documents(
    documents
)

print(len(chunks))

Enter fullscreen mode Exit fullscreen mode

Parameters Explained

chunk_size

How large each chunk should be.

Example:

chunk_size=500

Enter fullscreen mode Exit fullscreen mode

means:

500 characters per chunk.


chunk_overlap

Prevents context loss.

Example:

Chunk 1:

Artificial Intelligence is...

Enter fullscreen mode Exit fullscreen mode

Chunk 2 starts with:

Intelligence is...

Enter fullscreen mode Exit fullscreen mode

This preserves continuity.


Best Practices

Recommended:

chunk_size = 300800
chunk_overlap = 30100

Enter fullscreen mode Exit fullscreen mode

for most enterprise RAG systems.

4. Understanding Embeddings

Once chunking is completed, we need to convert text into a format machines can understand.

LLMs understand:

Numbers (Vectors)

Enter fullscreen mode Exit fullscreen mode

Not raw text.

This is where Embeddings come in.


What are Embeddings?

Embeddings convert text into numerical vector representations.

Example:

Text:

"Artificial Intelligence"

Enter fullscreen mode Exit fullscreen mode

becomes:

[0.24, -0.76, 0.88, ....]

Enter fullscreen mode Exit fullscreen mode

These vectors help us find:

Semantic Meaning

Example:

What is AI?

Enter fullscreen mode Exit fullscreen mode

and

Explain Artificial Intelligence

Enter fullscreen mode Exit fullscreen mode

have similar meanings.

Embedding models place them close together in vector space.


Why Embeddings are Important in RAG?

Without embeddings:

Search becomes:

Keyword matching

Enter fullscreen mode Exit fullscreen mode

Example:

Searching:

CEO

Enter fullscreen mode Exit fullscreen mode

Only returns exact keyword matches.

With embeddings:

Search becomes:

Semantic Search

Enter fullscreen mode Exit fullscreen mode

Meaning-based retrieval.

Even if wording differs.


NVIDIA Embeddings

We will use:

NVIDIA Llama Nemotron Embedding Model

Enter fullscreen mode Exit fullscreen mode

Advantages:

✅ Fast

✅ High-quality embeddings

✅ Good semantic understanding

✅ Free developer tier


Import Required Libraries

import os

from dotenv import load_dotenv

from langchain_nvidia_ai_endpoints import (
    NVIDIAEmbeddings
)

Enter fullscreen mode Exit fullscreen mode


Load Environment Variables

load_dotenv()

Enter fullscreen mode Exit fullscreen mode


Initialize Embedding Model

embedding_model = (
    NVIDIAEmbeddings(
        model=
        "nvidia/llama-nemotron-embed-vl-1b-v2",

        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Enter fullscreen mode Exit fullscreen mode


Convert Chunks into Embeddings

Before embedding:

We only need:

page_content

Enter fullscreen mode Exit fullscreen mode

from chunks.

Extract Text

texts = [
    chunk.page_content
    for chunk in chunks
]

Enter fullscreen mode Exit fullscreen mode


Generate Embeddings

embedded_vectors = (
    embedding_model.embed_documents(
        texts
    )
)

Enter fullscreen mode Exit fullscreen mode


Check Embedding Dimension

print(
    len(
        embedded_vectors
    )
)

print(
    len(
        embedded_vectors[0]
    )
)

Enter fullscreen mode Exit fullscreen mode

Output:

50
2048

Enter fullscreen mode Exit fullscreen mode

Meaning:

50 chunks
2048 dimensional vector

Enter fullscreen mode Exit fullscreen mode


Query Embedding

User questions also need embeddings.

Example:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Enter fullscreen mode Exit fullscreen mode

Now query and document vectors can be compared.


5. Vector Databases (Milvus)

Imagine storing:

Millions of embeddings

Enter fullscreen mode Exit fullscreen mode

in SQL.

Very slow.

Traditional databases are not optimized for:

Similarity Search

Enter fullscreen mode Exit fullscreen mode

We need:

Vector Database

Examples:

  • Pinecone
  • FAISS
  • Chroma
  • Milvus
  • Weaviate

We will use:

Milvus

Why?

✅ Fast retrieval

✅ Open-source

✅ Enterprise-ready

✅ Optimized for vectors


Install Milvus

pip install pymilvus

Enter fullscreen mode Exit fullscreen mode


Import Milvus

from pymilvus import (
    MilvusClient
)

Enter fullscreen mode Exit fullscreen mode


Create Milvus Connection

client = MilvusClient(
    uri="milvus_demo.db"
)

print(
    "Connected Successfully"
)

Enter fullscreen mode Exit fullscreen mode


Create Collection

A collection is like:

SQL Table

Enter fullscreen mode Exit fullscreen mode

for vector data.


Create Collection

try:

    client.create_collection(
        collection_name=
        "rag_collection",

        dimension=2048
    )

    print(
        "Collection Created"
    )

except Exception as e:

    print(e)

Enter fullscreen mode Exit fullscreen mode


Why Dimension Matters?

Embedding vector size:

2048

Enter fullscreen mode Exit fullscreen mode

Collection dimension must match embedding dimension.

Otherwise:

Insertion will fail

Enter fullscreen mode Exit fullscreen mode


Insert Data into Milvus

We store:

  1. ID
  2. Embedding vector
  3. Chunk text

Prepare Data

data = []

for i, (
    chunk,
    embedding
) in enumerate(
    zip(
        chunks,
        embedded_vectors
    )
):

    data.append({

        "id": i,

        "vector":
        embedding,

        "text":
        chunk.page_content
    })

Enter fullscreen mode Exit fullscreen mode


Insert into Collection

client.insert(
    collection_name=
    "rag_collection",

    data=data
)

print(
    "Inserted Successfully"
)

Enter fullscreen mode Exit fullscreen mode


6. Similarity Retrieval

Now comes the real magic.

When user asks:

"What is RAG?"

Enter fullscreen mode Exit fullscreen mode

We do:

  1. Convert query → embedding
  2. Search similar vectors
  3. Return relevant chunks

Generate Query Embedding

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Enter fullscreen mode Exit fullscreen mode


Search in Milvus

results = client.search(

    collection_name=
    "rag_collection",

    data=[
        query_embedding
    ],

    limit=5,

    output_fields=[
        "text"
    ]
)

Enter fullscreen mode Exit fullscreen mode


Understanding Parameters

limit

How many chunks to retrieve.

Example:

limit=5

Enter fullscreen mode Exit fullscreen mode

returns:

Top 5 relevant chunks

Enter fullscreen mode Exit fullscreen mode


output_fields

Fields to return.

Example:

"text"

Enter fullscreen mode Exit fullscreen mode

returns chunk text.


View Retrieved Chunks

for result in results[0]:

    print(
        result["entity"]
        ["text"]
    )

    print(
        "----------------"
    )

Enter fullscreen mode Exit fullscreen mode


Problem with Similarity Search

Sometimes:

Top results are not the best.

Example:

Query:

What is RAG?

Enter fullscreen mode Exit fullscreen mode

Retrieved:

Machine Learning

Enter fullscreen mode Exit fullscreen mode

instead of:

Retrieval-Augmented Generation

Enter fullscreen mode Exit fullscreen mode

This happens because:

Vector similarity is approximate.

Solution?

Reranking


7. Reranking

Reranking improves retrieval quality.

Instead of trusting:

Top K vectors

Enter fullscreen mode Exit fullscreen mode

We re-score chunks.


Why Reranking Matters?

Without reranking:

Bad chunks may enter context.

Result:

❌ hallucination

❌ irrelevant answers

With reranking:

Only most relevant chunks are sent to LLM.


Import Reranker

from langchain_nvidia_ai_endpoints import (
    NVIDIARerank
)

Enter fullscreen mode Exit fullscreen mode


Initialize Reranker

reranker = (
    NVIDIARerank(
        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Enter fullscreen mode Exit fullscreen mode


Convert Milvus Results → Documents

Reranker expects:

LangChain Documents

Enter fullscreen mode Exit fullscreen mode

not strings.

from langchain_core.documents import (
    Document
)

retrieved_docs = [

    Document(
        page_content=
        r["entity"]
        ["text"]
    )

    for r in results[0]
]

Enter fullscreen mode Exit fullscreen mode


Run Reranking

reranked_docs = (
    reranker.compress_documents(

        documents=
        retrieved_docs,

        query=query
    )
)

Enter fullscreen mode Exit fullscreen mode


View Reranked Results

for doc in reranked_docs:

    print(
        doc.page_content
    )

Enter fullscreen mode Exit fullscreen mode

Now quality improves significantly.


8. Azure OpenAI Response Generation

Finally:

We generate answer.


Import Azure OpenAI

from langchain_openai import (
    AzureChatOpenAI
)

Enter fullscreen mode Exit fullscreen mode


Initialize LLM

llm = AzureChatOpenAI(

    azure_endpoint=
    os.getenv(
        "AZURE_OPENAI_ENDPOINT"
    ),

    api_key=
    os.getenv(
        "AZURE_OPENAI_KEY"
    ),

    deployment_name=
    "gpt-4o",

    temperature=0.2
)

Enter fullscreen mode Exit fullscreen mode


Why Low Temperature?

Lower:

temperature=0.2

Enter fullscreen mode Exit fullscreen mode

means:

More factual answers.

Good for:

RAG systems

Enter fullscreen mode Exit fullscreen mode


Build Context

context = "\n".join([

    doc.page_content

    for doc in reranked_docs
])

Enter fullscreen mode Exit fullscreen mode


Prompt Engineering

prompt = f"""

Answer ONLY
from context.

Context:

{context}

Question:

{query}

"""

Enter fullscreen mode Exit fullscreen mode

Strict prompt:

Prevents hallucination.


Generate Answer

response = llm.invoke(
    prompt
)

print(
    response.content
)

Enter fullscreen mode Exit fullscreen mode


9. Langfuse Observability

Production AI systems require monitoring.

Questions:

Did retrieval work?
Did hallucination happen?
Was response relevant?

Enter fullscreen mode Exit fullscreen mode

Langfuse solves this.


Install

pip install langfuse

Enter fullscreen mode Exit fullscreen mode


Import

from langfuse import (
    Langfuse
)

Enter fullscreen mode Exit fullscreen mode


Initialize Langfuse

langfuse = Langfuse(

    public_key=
    os.getenv(
        "LANGFUSE_PUBLIC_KEY"
    ),

    secret_key=
    os.getenv(
        "LANGFUSE_SECRET_KEY"
    ),

    host=
    os.getenv(
        "LANGFUSE_BASE_URL"
    )
)

Enter fullscreen mode Exit fullscreen mode


Log Retrieval

langfuse.create_event(

    name="retrieval",

    input={
        "query":
        query
    },

    output={
        "chunks":
        context
    }
)

Enter fullscreen mode Exit fullscreen mode


10. RAG Evaluation

We evaluate:

Retrieval Quality

Were chunks relevant?


Faithfulness

Was answer grounded?


Hallucination Score

Did model invent information?


Answer Relevance

Did answer actually solve query?


Example evaluation prompt:

evaluation_prompt = f"""

Evaluate:

Question:
{query}

Answer:
{response.content}

Context:
{context}

Score:
1. faithfulness
2. hallucination
3. relevance
"""

Enter fullscreen mode Exit fullscreen mode


Production RAG Pipeline

PDFs
 ↓
Loaders
 ↓
Chunking
 ↓
Embeddings
 ↓
Milvus
 ↓
Retrieval
 ↓
Reranking
 ↓
Prompt Building
 ↓
GPT-4o
 ↓
Answer
 ↓
Langfuse Monitoring
 ↓
Evaluation

Enter fullscreen mode Exit fullscreen mode


Common Challenges

Bad Retrieval

Fix:

✅ Better chunking

✅ Reranking

✅ Hybrid Search


Hallucination

Fix:

✅ Strict prompts

✅ Low temperature

✅ Better retrieval


Large PDFs

Fix:

✅ Chunking strategy

✅ Metadata filtering


Advanced RAG Techniques

Multi-Vector Retrieval

One chunk → multiple embeddings.

Better retrieval.


HyDE

Generate hypothetical answer first.

Then search.


RAPTOR

Hierarchical retrieval tree.

Better long document understanding.


Semantic Routing

Route query dynamically.


ColBERT

Token-level retrieval.

Highly accurate.


Final Thoughts

Basic RAG:

Retrieve → Generate

Enter fullscreen mode Exit fullscreen mode

Production RAG:

Retrieve
→ Rerank
→ Evaluate
→ Monitor
→ Improve

Enter fullscreen mode Exit fullscreen mode

That is how enterprise AI systems are built 🚀