In the indexing stage of RAG, the core idea is to transform raw documents into searchable index data that can be consumed later. I usually break it down into 6 steps:
- Load documents: Bring the data in first, such as local files, databases, CMS content, or parsed PDF results. This answers the question: “Where does the data come from?”
- Document preprocessing: Clean, deduplicate, and normalize the format. This solves the problem that raw data often cannot be used directly.
- Document chunking: Split long documents into chunks of appropriate granularity. This answers the question: “What unit of knowledge should be inserted into the vector database?”
- Metadata enrichment: Add metadata to each chunk, such as title, source, tags, and access permissions. This makes later filtering, tracing, and display easier.
- Vectorization: Use an embedding model to convert each chunk into a vector. This solves the problem of turning text into a searchable semantic representation.
- Write to index storage: Finally, write the vectors and metadata into a vector database so they can be used later during the retrieval stage.
In other words, the essence of indexing is to transform raw documents into structured, semantically searchable index data that can be directly consumed by the retrieval pipeline.
Chunking directly affects RAG performance.
The core reason is that RAG does not retrieve the entire document. It retrieves the individual chunks.
Therefore, how we split the document largely determines what kind of context the modal can access later.
- If the chunks are too large, relevant inforamtion may be diluted, and retrieval becomes less focused.
- If the chunks are too small, the surrounding context is easily lost.
- If the chunks do not follow semantic boundaries, for example if a sentence is forcefullty split aparat, the content retrieved by the model will feel unnatural.
So essentially, chunking affects two things:
- Retrieval granularity
- Context quality
Common chunking strategies include:
- Fixed-size chunking: The simples approach. It is easy to implement and suitable for fast prototyping, but it can easily break semantic meaning.
- Fixed-size chunkiong with overlap: Adds overlap on top of fixed-size chunks to reduce the risk of information being cur off.
- Recursive chunking: First split by larger structures such as headings and paragraphs, then countinues splitting if the chunk is still to large. This is more suitable for structured documents.
- Semantic chunking: Splits based on semantic boundaries. It emphasizes the semantic completeness of each chunk and usually produces better results, but it is also more complex to implement.
Therefore, chunking determies the smallest retrieval unit of the knowledge that enters the vector database, so it directly affects rertieval quality and the final answer quality.
Fixed-size chunking is often not enough, because it only considers length, not semantics. It does not care whether a sentence is split in half, or whether a paragraph belongs to the same complete topic. So although it is simple, it can easily introduce two problems:
- Incomplete chunk semantics
- Retrieved context feels unnatural
This is especially true for technical documents, FAQs, and tutorials, which usually have a clear structure. If we always split them by fixed length, we may waste the original structural information.
- Semantic Chunking focuses more on splitting by semantic boundaries. The goal is to make each chunk a relatively complete semantic unit. It is more suitable for scenarios where context completeness matters, such as knowledge articles, FAQs, and concept explanations.
- Recursive Chunking is more like a practical engineering compromise. It first splits by larger structures such as headings and paragraphs. If the chunk is still too large, it continues splitting into smaller parts. It is more suitable for well-structured content such as Markdown, policy documents, and technical documentation.
In one sentence:
Fixed-size chunking is good as a baseline. Recursive Chunking is more suitable for most engineering scenarios. Semantic Chunking is better when semantic completeness and retrieval quality are more important.
I think metadata is very important in RAG.
The core reason is: vectors can only represent semantic similarity, but many retrieval decisions cannnot rely on semantics alone.
For example, the system also needs to know:
- Where this piece of content comes from
- Whether it is the latest version
- Whether the current user has permission to access it
- Which knowledge base or category it belongs to
- How it should be displayed after retrieval
All of these depend on metadata.
So we can understand metadata as: Structured context added to each chunk.
In the retrieval stage, metadata mainly has three roles:
- Filtering
For example, filtering content by knowledge base, language, permission, or time range.
- Display and explanation
For example, telling the user which document, chapter, or title this piece of content comes from.
- Ranking and governance
For example, prioritizing official documents, prioritizing the latest version, or down-ranking certain sources.
Common metadata fields usually include:
- Source information: document ID, source path, URL, knowledge base
- Structural information: title, chapter name,
chunkIndex, sourceDocumentId
- Business information: tags, category, department, permission level
- Freshness information: created time, updated time, version number
In short:
Metadata makes a chunk more than just “a piece of text”. It turns it into a knowledge unit with source, structure, and business context.
In real projects, capabilities such as filtering, permission control, display, and ranking all heavily depend on metadata.
If the content in a knowledge base keeps changing, then indexing should not be treated as a one-time setup. It should be treated as a continuously maintained process.
Usually, we need to handle three things at the same time: deduplication, cleaning, and incremental updates.
- Deduplication
Deduplication prevents repeated knowledge from being written into the vector database. It usually needs to be handled at two levels:
- Document-level deduplication: for example, deduplicating by document ID, source path, or content hash.
- Chunk-level deduplication: for example, removing repeated headers, footers, template text, or duplicated chunks.
- Cleaning
Cleaning prevents dirty data and noise from entering the index.
A common approach is to clean the raw document before chunking, such as removing empty content, garbled text, template text, and OCR noise.
After chunking, we can also apply lightweight filtering, such as removing empty chunks.
- Incremental updates
Incremental updates avoid rebuilding the entire index every time.
Usually, we use signals such as update time, version number, or content hash to detect which documents have changed. Then we handle them in three cases:
- New documents: create new index entries directly.
- Modified documents: re-chunk the document, regenerate embeddings, and overwrite the old index entries.
- Deleted documents: delete the corresponding vectors as well.