惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Proofpoint News Feed
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Attack and Defense Labs
Attack and Defense Labs
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
W
WeLiveSecurity
O
OpenAI News
SecWiki News
SecWiki News
博客园 - Franky
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
T
Tor Project blog
Microsoft Security Blog
Microsoft Security Blog
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
P
Privacy & Cybersecurity Law Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
月光博客
月光博客
李成银的技术随笔
Spread Privacy
Spread Privacy
F
Full Disclosure
F
Fortinet All Blogs
T
The Exploit Database - CXSecurity.com
Vercel News
Vercel News
AWS News Blog
AWS News Blog
WordPress大学
WordPress大学
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
V
Visual Studio Blog
J
Java Code Geeks
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Last Week in AI
Last Week in AI
P
Palo Alto Networks Blog
宝玉的分享
宝玉的分享
T
True Tiger Recordings
N
News and Events Feed by Topic
酷 壳 – CoolShell
酷 壳 – CoolShell
Cisco Talos Blog
Cisco Talos Blog
N
News | PayPal Newsroom
S
SegmentFault 最新的问题
Jina AI
Jina AI

博客园 - Zhentiw

[GenAI] Indexing overview [Vibe Coding] 降低大模型幻觉 - 重试机制 [Vibe coding] 降低大模型幻觉 - JSON 安全输出提示词 [Node.js] WebSocket基础知识 [LangGraph] 应用结构 - Zhentiw [LangGrpah] Unit testing [LangGraph] Functional API [LangGraph] 中断注意事项 [LangGraph] 中断相关细节 [LangGrpah] 静态断点 [LangGrpah] 非阻塞式中断 [LangGraph] 阻塞式中断 [LangGraph] 语义搜索 [LangGraph] 长期记忆 [LangGraph] 管理短期记忆 [LangGraph] 自定义checkpointer [LangGraph] 短期记忆 [LangGraph] 时间旅行 [LangGraph] checkpoint常用API [LangGraph] 元数据标记 [LangGraph] 流 [Vitest] mockClear, mockReset, mockRestore [LangGraph] 将子图添加为节点
[GenAI] About Indexing
Zhentiw · 2026-05-21 · via 博客园 - Zhentiw

In the indexing stage of RAG, the core idea is to transform raw documents into searchable index data that can be consumed later. I usually break it down into 6 steps:

  1. Load documents: Bring the data in first, such as local files, databases, CMS content, or parsed PDF results. This answers the question: “Where does the data come from?”
  2. Document preprocessing: Clean, deduplicate, and normalize the format. This solves the problem that raw data often cannot be used directly.
  3. Document chunking: Split long documents into chunks of appropriate granularity. This answers the question: “What unit of knowledge should be inserted into the vector database?”
  4. Metadata enrichment: Add metadata to each chunk, such as title, source, tags, and access permissions. This makes later filtering, tracing, and display easier.
  5. Vectorization: Use an embedding model to convert each chunk into a vector. This solves the problem of turning text into a searchable semantic representation.
  6. Write to index storage: Finally, write the vectors and metadata into a vector database so they can be used later during the retrieval stage.

In other words, the essence of indexing is to transform raw documents into structured, semantically searchable index data that can be directly consumed by the retrieval pipeline.


Chunking directly affects RAG performance.

The core reason is that RAG does not retrieve the entire document. It retrieves the individual chunks.

Therefore, how we split the document largely determines what kind of context the modal can access later.

  • If the chunks are too large, relevant inforamtion may be diluted, and retrieval becomes less focused.
  • If the chunks are too small, the surrounding context is easily lost.
  • If the chunks do not follow semantic boundaries, for example if a sentence is forcefullty split aparat, the content retrieved by the model will feel unnatural.

So essentially, chunking affects two things:

  1. Retrieval granularity
  2. Context quality

Common chunking strategies include:

  1. Fixed-size chunking: The simples approach. It is easy to implement and suitable for fast prototyping, but it can easily break semantic meaning.
  2. Fixed-size chunkiong with overlap: Adds overlap on top of fixed-size chunks to reduce the risk of information being cur off.
  3. Recursive chunking: First split by larger structures such as headings and paragraphs, then countinues splitting if the chunk is still to large. This is more suitable for structured documents.
  4. Semantic chunking: Splits based on semantic boundaries. It emphasizes the semantic completeness of each chunk and usually produces better results, but it is also more complex to implement.

Therefore, chunking determies the smallest retrieval unit of the knowledge that enters the vector database, so it directly affects rertieval quality and the final answer quality.


Fixed-size chunking is often not enough, because it only considers length, not semantics. It does not care whether a sentence is split in half, or whether a paragraph belongs to the same complete topic. So although it is simple, it can easily introduce two problems:

  1. Incomplete chunk semantics
  2. Retrieved context feels unnatural

This is especially true for technical documents, FAQs, and tutorials, which usually have a clear structure. If we always split them by fixed length, we may waste the original structural information.

  • Semantic Chunking focuses more on splitting by semantic boundaries. The goal is to make each chunk a relatively complete semantic unit. It is more suitable for scenarios where context completeness matters, such as knowledge articles, FAQs, and concept explanations.
  • Recursive Chunking is more like a practical engineering compromise. It first splits by larger structures such as headings and paragraphs. If the chunk is still too large, it continues splitting into smaller parts. It is more suitable for well-structured content such as Markdown, policy documents, and technical documentation.

In one sentence:

Fixed-size chunking is good as a baseline. Recursive Chunking is more suitable for most engineering scenarios. Semantic Chunking is better when semantic completeness and retrieval quality are more important.


I think metadata is very important in RAG.

The core reason is: vectors can only represent semantic similarity, but many retrieval decisions cannnot rely on semantics alone.

For example, the system also needs to know:

  • Where this piece of content comes from
  • Whether it is the latest version
  • Whether the current user has permission to access it
  • Which knowledge base or category it belongs to
  • How it should be displayed after retrieval

All of these depend on metadata.

So we can understand metadata as: Structured context added to each chunk.

In the retrieval stage, metadata mainly has three roles:

  1. Filtering
    For example, filtering content by knowledge base, language, permission, or time range.
  2. Display and explanation
    For example, telling the user which document, chapter, or title this piece of content comes from.
  3. Ranking and governance
    For example, prioritizing official documents, prioritizing the latest version, or down-ranking certain sources.

Common metadata fields usually include:

  • Source information: document ID, source path, URL, knowledge base
  • Structural information: title, chapter name, chunkIndex, sourceDocumentId
  • Business information: tags, category, department, permission level
  • Freshness information: created time, updated time, version number

In short:

Metadata makes a chunk more than just “a piece of text”. It turns it into a knowledge unit with source, structure, and business context.

In real projects, capabilities such as filtering, permission control, display, and ranking all heavily depend on metadata.


If the content in a knowledge base keeps changing, then indexing should not be treated as a one-time setup. It should be treated as a continuously maintained process.

Usually, we need to handle three things at the same time: deduplication, cleaning, and incremental updates.

  1. Deduplication
    Deduplication prevents repeated knowledge from being written into the vector database. It usually needs to be handled at two levels:
    • Document-level deduplication: for example, deduplicating by document ID, source path, or content hash.
    • Chunk-level deduplication: for example, removing repeated headers, footers, template text, or duplicated chunks.
  2. Cleaning
    Cleaning prevents dirty data and noise from entering the index.
    A common approach is to clean the raw document before chunking, such as removing empty content, garbled text, template text, and OCR noise.
    After chunking, we can also apply lightweight filtering, such as removing empty chunks.
  3. Incremental updates
    Incremental updates avoid rebuilding the entire index every time.
    Usually, we use signals such as update time, version number, or content hash to detect which documents have changed. Then we handle them in three cases:
    • New documents: create new index entries directly.
    • Modified documents: re-chunk the document, regenerate embeddings, and overwrite the old index entries.
    • Deleted documents: delete the corresponding vectors as well.