Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀

Beyond Basic RAG: Learn LangChain + RAG End-to-End 🚀

Introduction

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.

Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:

Hallucination

Hallucination means:

The model confidently generates incorrect information.

Example:

Question:

Who is the CEO of my company?

Without access to your internal company data, an LLM may generate a completely wrong answer.

This is where RAG (Retrieval-Augmented Generation) becomes useful.

Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.

What is RAG?

RAG stands for:

Retrieval-Augmented Generation

Instead of:

Question → LLM → Answer

We do:

Question
   ↓
Retrieve Relevant Documents
   ↓
Provide Context to LLM
   ↓
Generate Grounded Response

This makes responses:

✅ More accurate

✅ Context-aware

✅ Less hallucinated

✅ Enterprise-ready

Complete RAG Architecture

Documents (PDFs, DOCX, TXT)
            ↓
      Document Loading
            ↓
         Chunking
            ↓
         Embeddings
            ↓
      Vector Database
            ↓
      Similarity Search
            ↓
         Reranking
            ↓
       Context Building
            ↓
            LLM
            ↓
         Final Answer
            ↓
     Monitoring & Evaluation

Required Installation

Before starting, install all dependencies.

pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv

Project Structure

project/
│
├── data/
│   ├── pdf/
│   └── text/
│
├── .env
├── rag_pipeline.py
└── requirements.txt

Environment Variables (.env)

Never hardcode API keys.

Create a .env file.

NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com

1. Understanding LangChain Document Structure

LangChain stores documents in a standardized format.

A document contains:

page_content
metadata

page_content

This contains actual text.

Example:

page_content = "Generative AI is growing rapidly."

metadata

Metadata stores additional information.

Examples:

file name
author
created date
source
page number

Creating a LangChain Document

Import

from langchain_core.documents import Document

Code

from langchain_core.documents import Document

doc = Document(
    page_content="""
    Generative AI is a subset of Artificial Intelligence
    focused on creating content.
    """,
    metadata={
        "source": "genai.pdf",
        "author": "Sridhar",
        "pages": 10
    }
)

print(doc)

Output

Document(
    page_content='Generative AI...',
    metadata={
        'source': 'genai.pdf',
        'author': 'Sridhar',
        'pages': 10
    }
)

Why metadata matters?

In enterprise AI:

You often want:

“Show answer from document X page 5”

Metadata helps with traceability.

2. Loading Documents

Before processing documents, we must load them.

LangChain provides multiple loaders.

TextLoader

Used for:

.txt files
plain text files

Import

from langchain_community.document_loaders import TextLoader

Example

loader = TextLoader(
    "data/text/sample.txt",
    encoding="utf-8"
)

documents = loader.load()

print(documents)

DirectoryLoader

Loads multiple files from a folder.

Useful when:

You have:

100 PDFs
50 TXT files
many documents

Import

from langchain_community.document_loaders import DirectoryLoader

Example

loader = DirectoryLoader(
    "data/text",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={
        "encoding":"utf-8"
    }
)

documents = loader.load()

print(documents)

PDF Loader

Most enterprise RAG systems use PDFs.

LangChain supports:

PyPDFLoader

Simple and fast.

Import

from langchain_community.document_loaders import PyPDFLoader

Example

loader = PyPDFLoader(
    "data/pdf/rag_guide.pdf"
)

documents = loader.load()

print(documents[0])

Each page becomes:

Document(
    page_content="Page text",
    metadata={"page":1}
)

3. Chunking Documents

Chunking is one of the most important parts of RAG.

Why?

Because LLMs have token limits.

You cannot send:

500 page PDF

to GPT.

Instead:

We split documents into smaller chunks.

Why Chunking Matters?

Bad chunking causes:

❌ poor retrieval

❌ hallucination

❌ context loss

Good chunking improves:

✅ retrieval quality

✅ relevance

✅ accuracy

RecursiveCharacterTextSplitter

Most commonly used splitter.

Import

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter
)

Code

text_splitter = (
    RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",
            "\n",
            " ",
            ""
        ]
    )
)

chunks = text_splitter.split_documents(
    documents
)

print(len(chunks))

Parameters Explained

chunk_size

How large each chunk should be.

Example:

chunk_size=500

means:

500 characters per chunk.

chunk_overlap

Prevents context loss.

Example:

Chunk 1:

Artificial Intelligence is...

Chunk 2 starts with:

Intelligence is...

This preserves continuity.

Best Practices

Recommended:

chunk_size = 300–800
chunk_overlap = 30–100

for most enterprise RAG systems.

4. Understanding Embeddings

Once chunking is completed, we need to convert text into a format machines can understand.

LLMs understand:

Numbers (Vectors)

Not raw text.

This is where Embeddings come in.

What are Embeddings?

Embeddings convert text into numerical vector representations.

Example:

Text:

"Artificial Intelligence"

becomes:

[0.24, -0.76, 0.88, ....]

These vectors help us find:

Semantic Meaning

Example:

What is AI?

and

Explain Artificial Intelligence

have similar meanings.

Embedding models place them close together in vector space.

Why Embeddings are Important in RAG?

Without embeddings:

Search becomes:

Keyword matching

Example:

Searching:

CEO

Only returns exact keyword matches.

With embeddings:

Search becomes:

Semantic Search

Meaning-based retrieval.

Even if wording differs.

NVIDIA Embeddings

We will use:

NVIDIA Llama Nemotron Embedding Model

Advantages:

✅ Fast

✅ High-quality embeddings

✅ Good semantic understanding

✅ Free developer tier

Import Required Libraries

import os

from dotenv import load_dotenv

from langchain_nvidia_ai_endpoints import (
    NVIDIAEmbeddings
)

Load Environment Variables

load_dotenv()

Initialize Embedding Model

embedding_model = (
    NVIDIAEmbeddings(
        model=
        "nvidia/llama-nemotron-embed-vl-1b-v2",

        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Convert Chunks into Embeddings

Before embedding:

We only need:

page_content

from chunks.

Extract Text

texts = [
    chunk.page_content
    for chunk in chunks
]

Generate Embeddings

embedded_vectors = (
    embedding_model.embed_documents(
        texts
    )
)

Check Embedding Dimension

print(
    len(
        embedded_vectors
    )
)

print(
    len(
        embedded_vectors[0]
    )
)

Output:

50
2048

Meaning:

50 chunks
2048 dimensional vector

Query Embedding

User questions also need embeddings.

Example:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Now query and document vectors can be compared.

5. Vector Databases (Milvus)

Imagine storing:

Millions of embeddings

in SQL.

Very slow.

Traditional databases are not optimized for:

Similarity Search

We need:

Vector Database

Examples:

Pinecone
FAISS
Chroma
Milvus
Weaviate

We will use:

Milvus

Why?

✅ Fast retrieval

✅ Open-source

✅ Enterprise-ready

✅ Optimized for vectors

Install Milvus

pip install pymilvus

Import Milvus

from pymilvus import (
    MilvusClient
)

Create Milvus Connection

client = MilvusClient(
    uri="milvus_demo.db"
)

print(
    "Connected Successfully"
)

Create Collection

A collection is like:

SQL Table

for vector data.

Create Collection

try:

    client.create_collection(
        collection_name=
        "rag_collection",

        dimension=2048
    )

    print(
        "Collection Created"
    )

except Exception as e:

    print(e)

Why Dimension Matters?

Embedding vector size:

Collection dimension must match embedding dimension.

Otherwise:

Insertion will fail

Insert Data into Milvus

We store:

ID
Embedding vector
Chunk text

Prepare Data

data = []

for i, (
    chunk,
    embedding
) in enumerate(
    zip(
        chunks,
        embedded_vectors
    )
):

    data.append({

        "id": i,

        "vector":
        embedding,

        "text":
        chunk.page_content
    })

Insert into Collection

client.insert(
    collection_name=
    "rag_collection",

    data=data
)

print(
    "Inserted Successfully"
)

6. Similarity Retrieval

Now comes the real magic.

When user asks:

"What is RAG?"

We do:

Convert query → embedding
Search similar vectors
Return relevant chunks

Generate Query Embedding

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Search in Milvus

results = client.search(

    collection_name=
    "rag_collection",

    data=[
        query_embedding
    ],

    limit=5,

    output_fields=[
        "text"
    ]
)

Understanding Parameters

limit

How many chunks to retrieve.

Example:

limit=5

returns:

Top 5 relevant chunks

output_fields

Fields to return.

Example:

"text"

returns chunk text.

View Retrieved Chunks

for result in results[0]:

    print(
        result["entity"]
        ["text"]
    )

    print(
        "----------------"
    )

Problem with Similarity Search

Sometimes:

Reranking

7. Reranking

Reranking improves retrieval quality.

Instead of trusting:

Top K vectors

We re-score chunks.

Why Reranking Matters?

Without reranking:

Bad chunks may enter context.

Result:

❌ hallucination

❌ irrelevant answers

With reranking:

Only most relevant chunks are sent to LLM.

Import Reranker

from langchain_nvidia_ai_endpoints import (
    NVIDIARerank
)

Initialize Reranker

reranker = (
    NVIDIARerank(
        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Convert Milvus Results → Documents

Reranker expects:

LangChain Documents

not strings.

from langchain_core.documents import (
    Document
)

retrieved_docs = [

    Document(
        page_content=
        r["entity"]
        ["text"]
    )

    for r in results[0]
]

Run Reranking

reranked_docs = (
    reranker.compress_documents(

        documents=
        retrieved_docs,

        query=query
    )
)

View Reranked Results

for doc in reranked_docs:

    print(
        doc.page_content
    )

Now quality improves significantly.

8. Azure OpenAI Response Generation

Finally:

We generate answer.

Import Azure OpenAI

from langchain_openai import (
    AzureChatOpenAI
)

Initialize LLM

llm = AzureChatOpenAI(

    azure_endpoint=
    os.getenv(
        "AZURE_OPENAI_ENDPOINT"
    ),

    api_key=
    os.getenv(
        "AZURE_OPENAI_KEY"
    ),

    deployment_name=
    "gpt-4o",

    temperature=0.2
)

Why Low Temperature?

Lower:

temperature=0.2

means:

Build Context

context = "\n".join([

    doc.page_content

    for doc in reranked_docs
])

Prompt Engineering

prompt = f"""

Answer ONLY
from context.

Context:

{context}

Question:

{query}

"""

Strict prompt:

Prevents hallucination.

Generate Answer

response = llm.invoke(
    prompt
)

print(
    response.content
)

9. Langfuse Observability

Production AI systems require monitoring.

Questions:

Did retrieval work?
Did hallucination happen?
Was response relevant?

Langfuse solves this.

Install

pip install langfuse

Import

from langfuse import (
    Langfuse
)

Initialize Langfuse

langfuse = Langfuse(

    public_key=
    os.getenv(
        "LANGFUSE_PUBLIC_KEY"
    ),

    secret_key=
    os.getenv(
        "LANGFUSE_SECRET_KEY"
    ),

    host=
    os.getenv(
        "LANGFUSE_BASE_URL"
    )
)

Log Retrieval

langfuse.create_event(

    name="retrieval",

    input={
        "query":
        query
    },

    output={
        "chunks":
        context
    }
)

10. RAG Evaluation

We evaluate:

Retrieval Quality

Were chunks relevant?

Faithfulness

Was answer grounded?

Hallucination Score

Did model invent information?

Answer Relevance

Did answer actually solve query?

Example evaluation prompt:

evaluation_prompt = f"""

Evaluate:

Question:
{query}

Answer:
{response.content}

Context:
{context}

Score:
1. faithfulness
2. hallucination
3. relevance
"""

Production RAG Pipeline

PDFs
 ↓
Loaders
 ↓
Chunking
 ↓
Embeddings
 ↓
Milvus
 ↓
Retrieval
 ↓
Reranking
 ↓
Prompt Building
 ↓
GPT-4o
 ↓
Answer
 ↓
Langfuse Monitoring
 ↓
Evaluation

Common Challenges

Bad Retrieval

Fix:

✅ Better chunking

✅ Reranking

✅ Hybrid Search

Hallucination

Fix:

✅ Strict prompts

✅ Low temperature

✅ Better retrieval

Large PDFs

Fix:

✅ Chunking strategy

✅ Metadata filtering

Advanced RAG Techniques

Multi-Vector Retrieval

One chunk → multiple embeddings.

Better retrieval.

HyDE

Generate hypothetical answer first.

Then search.

RAPTOR

Hierarchical retrieval tree.

Better long document understanding.

Semantic Routing

Route query dynamically.

ColBERT

Token-level retrieval.

Highly accurate.

Final Thoughts

Basic RAG:

Retrieve → Generate

Production RAG:

Retrieve
→ Rerank
→ Evaluate
→ Monitor
→ Improve

That is how enterprise AI systems are built 🚀

推荐订阅源

DEV Community

Beyond Basic RAG: Learn LangChain + RAG End-to-End 🚀

Introduction

Hallucination

What is RAG?

Complete RAG Architecture

Required Installation

Project Structure

Environment Variables (.env)

1. Understanding LangChain Document Structure

page_content

metadata

Creating a LangChain Document

Import

Code

Output

2. Loading Documents

TextLoader

Import

Example

DirectoryLoader

Import

Example

PDF Loader

PyPDFLoader

Import

Example

3. Chunking Documents

Why Chunking Matters?

RecursiveCharacterTextSplitter

Import

Code

Parameters Explained

chunk_size

chunk_overlap

Best Practices

for most enterprise RAG systems.

4. Understanding Embeddings

What are Embeddings?

Semantic Meaning

Why Embeddings are Important in RAG?

NVIDIA Embeddings

Import Required Libraries

Load Environment Variables

Initialize Embedding Model

Convert Chunks into Embeddings

Extract Text

Generate Embeddings

Check Embedding Dimension

Query Embedding

5. Vector Databases (Milvus)

Vector Database

Milvus

Install Milvus

Import Milvus

Create Milvus Connection

Create Collection

Create Collection

Why Dimension Matters?

Insert Data into Milvus

Prepare Data

Insert into Collection

6. Similarity Retrieval

Generate Query Embedding

Search in Milvus

Understanding Parameters

limit

output_fields

View Retrieved Chunks

Problem with Similarity Search

Reranking

7. Reranking

Why Reranking Matters?

Import Reranker

Initialize Reranker

Convert Milvus Results → Documents

Run Reranking

View Reranked Results

8. Azure OpenAI Response Generation