今日开源[第13期]turbovec

博客园 - zhang-yd

今日开源[第16期]soxoj/maigret 论文解读-《Dual-Kernel Graph Community Contrastive Learning》今日开源[第15期]agent-skills 论文解读-《Hyperbolic Continuous Structural Entropy for Hierarchical Clustering》今日开源[第14期]google/skills 今日开源[第12期]LiteParse 今日开源[第11期]OmniVoice-Studio 今日开源[第10期]ds4（DwarfStar）今日开源[第9期]graphify 今日开源[第8期]open-notebook 今日开源[第7期]spec-kit 今日开源[第6期]Production Agentic RAG Course 今日开源[第5期]Headroom 今日开源[第4期]OpenTalking 今日开源[第3期]train-llm-from-scratch 今日开源[第2期]Project N.O.M.A.D. 今日开源[第1期]MoneyPrinterTurbo LearningCell代码解读论文解读-《It Takes a Graph to Know a Graph Rewiring for Homophily with a Reference Graph》论文解读-《Mitigating Over-Squashing in Graph Neural Networks by Spectrum-Preserving Sparsification》论文解读-《Make Heterophily Graphs Better Fit GNN A Graph Rewiring Approach》论文解读-《Temporal Graph Rewiring with Expander Graphs 》论文解读-《Understanding Oversquashing in GNNs through the Lens of Effective Resistance》论文解读-《Homophily-oriented Heterogeneous Graph Rewiring》论文-Deep appearance modeling: A survey 代码阅读笔记-nanoclaw 代码阅读笔记-OpenManus 论文解读-《An Empirical Evaluation of Rewiring Approaches in Graph Neural Networks》论文解读-《Probabilistic Graph Rewiring via Virtual Nodes》论文解读-《Probabilistically Rewired Message-Passing Neural Networks》论文解读-《Joint Graph Rewiring and Feature Denoising via Spectral Resonance》代码阅读笔记-nanobot 论文解读-《Oversquashing in GNNs through the lens of information contraction and graph expansion》论文解读-《GNNs Getting ComFy Community and Feature Similarity Guided Rewiring》 - zhang-yd 论文解读-《PANDA Expanded Width-Aware Message Passing Beyond Rewiring》代码阅读笔记-AiPyApp 论文解读-《Deep Graph Contrastive Representation Learning》论文解读-《Community-Invariant Graph Contrastive Learning》论文解读-《DiffWire Inductive Graph Rewiring via the Lovász Bound》论文解读-《The Effectiveness of Curvature-Based Rewiring and the Role of Hyperparameters in GNNs Revisited》论文解读-《Over-Squashing in GNNs and Causal Inference of Rewiring Strategies》论文解读-《Uncertainty-Aware Graph Structure Learning》

今日开源[第13期]turbovec

zhang-yd · 2026-06-12 · via 博客园 - zhang-yd

turbovec 项目深度分析报告

分析日期：2026-06-11
分析对象：RyanCodrai/turbovec

1. 项目介绍

1.1 项目概览

turbovec 是一个基于 Rust 实现的高性能向量索引库，提供 Python 绑定。它构建在 Google Research 的 TurboQuant 算法之上——这是一种数据无关（data-oblivious）的量化器，其失真度逼近 Shannon 下界，且无需码本训练、无需单独的训练阶段。

核心价值：一个 1000 万向量的语料库，以 float32 存储需要 31 GB 内存；使用 turbovec 可压缩至 4 GB，且搜索速度比 FAISS 更快。

1.2 项目基础信息

项目	内容
项目名	turbovec
项目地址	https://github.com/RyanCodrai/turbovec
项目官网	https://github.com/RyanCodrai/turbovec
许可证	MIT
编程语言	Rust (核心库) + Python (绑定)
当前版本 (Rust)	v0.8.1
当前版本 (Python)	v0.7.1
关键词	vector-search, quantization, nearest-neighbor, ann, simd

1.3 项目架构示意图

┌─────────────────────────────────────────────────────────────┐
│                       Python 调用层                           │
│  (pip install turbovec)                                       │
│  ┌───────────────┐    ┌────────────────┐                    │
│  │ TurboQuantIndex │    │   IdMapIndex    │                    │
│  │ (位置索引)      │    │ (稳定ID映射)     │                    │
│  └──────┬────────┘    └────────┬─────────┘                    │
└─────────┼──────────────────────┼──────────────────────────────┘
          │ PyO3 / maturin       │
          ▼                      ▼
┌─────────────────────────────────────────────────────────────┐
│                      Rust 核心库 (turbovec-core)              │
│                                                               │
│  ┌──────────┐   ┌──────────┐   ┌───────────┐   ┌──────────┐  │
│  │ rotation │   │ codebook │   │  encode   │   │  search  │  │
│  │ 随机旋转 │   │ Lloyd-Max│   │  编码压缩 │   │ SIMD检索 │  │
│  └────┬─────┘   └────┬─────┘   └────┬──────┘   └────┬─────┘  │
│       │              │               │               │        │
│       └──────────────┼───────────────┼───────────────┘        │
│                      ▼               ▼                        │
│              ┌──────────────────────────────┐                 │
│              │     TurboQuantIndex           │                 │
│              │  (lib.rs 主结构体 + OnceLock) │                 │
│              └──────────────────────────────┘                 │
│                     │                                         │
│                     ▼                                         │
│              ┌───────────────┐  ┌─────────┐                   │
│              │     pack.rs    │  │   io.rs  │                  │
│              │  位打包模块    │  │  序列化 │                   │
│              └───────────────┘  └─────────┘                   │
└──────────────────────────────────────────────────────────────┘

1.4 向量压缩流程示意

float32 向量 v (d维, 4*d 字节)
       │
       ▼  归一化 (norm)  +  随机正交旋转 (d×d 矩阵)
单位向量 u (高维球面上)
       │
       ▼  TQ+ 坐标校准: u_calibrated[d] = (u[d] + shift[d]) * scale_tq[d]
校准后向量
       │
       ▼  Lloyd-Max 标量量化 (2-4 比特/坐标)
离散码本 code[0..d]  (每个坐标 ∈ [0, 2^bits-1])
       │
       ▼  位打包 bit-packing
packed_codes (d*bits/8 字节)
       │
       ▼  长度重归一化尺度: scale = ||v|| / ⟨u, x_hat⟩
scales[0..n] (float32, 每向量一个)
       │
       ▼
索引结构 (packed_codes + scales + TQ+ trailer)

2. 项目亮点

2.1 算法层面

亮点	描述
Data-Oblivious 量化	无需训练阶段、无需码本学习。向量化即插即用，增量添加无需重建索引
Shannon 下界逼近	Lloyd-Max 量化器的失真度在信息论下界的 2.7 倍以内
TQ+ 坐标级校准	首次 `add` 时为每个坐标拟合 2 个标量（shift/scale），将经验 5/95% 分位数映射到理论 Beta 分布分位数，显著改善低维情况的召回
长度重归一化	存储每向量一个 `scale =

2.2 工程层面

亮点	描述
手写 SIMD 内核	ARM 使用 NEON 指令集，x86 使用 AVX-512BW + AVX2 回退；通过 `is_x86_feature_detected!` 运行时门控
速度优于 FAISS	在 Apple M3 Max (ARM) 上，相较 FAISS IndexPQ FastScan 提升 12-20%；在 Xeon 8481C 上 4-bit 配置领先 1-6%
搜索时过滤	支持传入 `mask` 或 `allowlist`，在 SIMD 内核内以 32-向量块粒度短路，避免过度拉取（over-fetching）
Rust + Python 双层	纯 Rust 内核保证内存安全与裸金属性能；通过 PyO3 + maturin 提供 Python 绑定
LangChain / LlamaIndex / Haystack 集成	提供 drop-in 替换类，一行 import 即可替换现有向量存储
并发安全	核心索引结构使用 `std::sync::OnceLock` 延迟初始化缓存（旋转矩阵、Lloyd-Max 质心、SIMD 块布局），多个线程可并发调用 `search`
增量索引	支持增量 `add`，首次 `add` 锁定校准参数后，后续添加无需重建

3. 运行环境与条件

3.1 运行环境

环境	要求
操作系统	Linux / macOS (有 BLAS 支持)；Windows 可通过 `cargo build` 但无 BLAS 加速
Rust 版本	≥ 1.70
Python 版本	≥ 3.9 (PyO3 abi3-py39)
架构	aarch64 (NEON) 或 x86_64 (AVX2/AVX-512)
BLAS 依赖	macOS 使用 Accelerate Framework，Linux 使用 OpenBLAS

3.2 依赖库（主要）

Rust 侧：

依赖	版本	用途
`ndarray`	0.17	多维数组（旋转矩阵乘法）
`rayon`	1.10	数据并行（行级并行化归一化、编码）
`statrs`	0.17	Beta 分布 CDF / 分位数（Lloyd-Max 与 TQ+ 校准）
`faer`	0.20	线性代数辅助
`rand` / `rand_chacha` / `rand_distr`	0.8/0.3/0.4	生成确定性随机旋转矩阵（种子 42）
`pyo3`	0.27	Python 绑定
`numpy` (for Python)	0.27	与 NumPy 数组交互

Python 侧（框架集成）：

框架	安装
LangChain	`pip install turbovec[langchain]`
LlamaIndex	`pip install turbovec[llama-index]`
Haystack	`pip install turbovec[haystack]`
Agno	`pip install turbovec[agno]`

3.3 运行条件与约束

向量维度 必须是 8 的倍数（dim % 8 == 0），这是位打包与 SIMD 块布局的约束
位宽支持 2、3、4 比特每坐标
TQ+ 校准最小样本数：1000。首次 add 若少于 1000 向量则退化为 identity 校准（不影响功能，仅损失 ~1pp 召回）
输入值检查：拒绝 NaN / ±Inf / 绝对值 ≥ 1e16 的坐标，防止 f32 范数计算溢出
x86_64 构建 通过 .cargo/config.toml 统一目标为 x86-64-v3（AVX2 baseline, Haswell 2013+）

3.4 安装与运行示例

Python:

pip install turbovec

from turbovec import TurboQuantIndex, IdMapIndex

# 基础使用
index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors)              # vectors 为 (n, 1536) float32
scores, indices = index.search(queries, k=10)
index.write("my_index.tv")
loaded = TurboQuantIndex.load("my_index.tv")

# 带稳定 ID + 搜索时过滤
idx = IdMapIndex(dim=1536, bit_width=4)
idx.add_with_ids(vectors, ids)  # ids 为 uint64
scores, ids = idx.search(queries, k=10, allowlist=candidate_ids)

Rust:

[dependencies]
turbovec = "0.8"

use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4).unwrap();
index.add(&vectors);
let results = index.search(&queries, 10);
index.write("index.tv").unwrap();

4. 代码架构与核心模块解析

4.1 仓库目录结构

turbovec/
├── Cargo.toml                          # 工作空间根
├── README.md                           # 主文档 (含基准测试与工作原理)
├── CHANGELOG.md
├── CONTRIBUTING.md
├── .cargo/config.toml                  # x86-64-v3 目标配置
├── turbovec/                           # Rust 核心库 (crate)
│   ├── Cargo.toml
│   ├── build.rs                        # BLAS 链接指令发射
│   └── src/
│       ├── lib.rs                      # TurboQuantIndex 主实现
│       ├── id_map.rs                   # IdMapIndex 稳定 ID 封装
│       ├── codebook.rs                 # Lloyd-Max 量化器
│       ├── rotation.rs                 # 随机正交旋转矩阵
│       ├── encode.rs                   # 归一化 + 旋转 + TQ+ + 量化 + 打包
│       ├── pack.rs                     # 位打包工具
│       ├── search.rs                   # NEON/AVX2/AVX-512BW SIMD 搜索内核
│       ├── io.rs                       # .tv / .tvim 序列化格式
│       └── error.rs                    # 构造 / 添加错误类型
├── turbovec-python/                    # Python 绑定 (crate)
│   ├── Cargo.toml
│   ├── pyproject.toml
│   └── src/lib.rs                      # PyO3 封装层
├── docs/
│   ├── api.md                          # API 参考手册
│   └── integrations/{langchain,llama_index,haystack,agno}.md
├── benchmarks/                         # 基准测试脚本与数据
│   ├── download_data.py
│   ├── suite/{speed,recall,compression}_*.py
│   └── results/
└── examples/downstream-smoke/          # 独立 smoke test crate

4.2 代码架构图（模块依赖）

              ┌──────────┐
              │  users   │ (Python / Rust)
              └────┬─────┘
                   │
        ┌──────────┼──────────┐
        ▼          ▼          ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ IdMapIdx │ │ lib.rs   │ │ lib.rs   │  ← TurboQuantIndex
  │ (id_map) │ │ (Python) │ │ (Rust)   │
  └────┬─────┘ └────┬─────┘ └────┬─────┘
       │            │            │
       └────────────┼────────────┘
                    ▼
              ┌────────────────┐
              │ TurboQuantIndex│ ← 结构体核心
              │ (lib.rs)       │
              └───┬──┬───┬─────┘
                  │  │   │
     ┌────────────┘  │   └────────────┐
     ▼               ▼                ▼
┌─────────┐    ┌──────────┐    ┌──────────┐
│ rotation│    │ codebook │    │  encode  │
│ matrix  │    │Lloyd-Max │    │ pack+scale│
└─────────┘    └──────────┘    └────┬─────┘
                                     │
                                     ▼
                               ┌──────────┐
                               │  pack.rs  │
                               └──────────┘
              ┌───────────────────────────────┐
              │        search.rs              │
              │  NEON / AVX2 / AVX-512BW 内核│
              │  (per-query LUT + heap top-k) │
              └───────────────────────────────┘
              ┌───────────────────────────────┐
              │           io.rs               │
              │   .tv magic + version + codes │
              │   + scales + TQ+ trailer      │
              └───────────────────────────────┘

4.3 核心模块解析

4.3.1 `lib.rs` — TurboQuantIndex 主结构

位于 turbovec/src/lib.rs。这是整个库的核心数据结构，字段如下：

pub struct TurboQuantIndex {
    dim: Option<usize>,        // 向量维度; None = lazy 模式
    bit_width: usize,          // 每坐标位数: 2/3/4
    n_vectors: usize,          // 已索引向量数
    packed_codes: Vec<u8>,     // 位打包的量化码
    scales: Vec<f32>,          // 每向量的长度重归一化尺度
    tqplus_shift: Vec<f32>,    // TQ+ 每个坐标 shift (首次 add 锁定)
    tqplus_scale: Vec<f32>,    // TQ+ 每个坐标 scale
    rotation: OnceLock<Vec<f32>>,      // 延迟初始化: 旋转矩阵 d×d
    boundaries: OnceLock<Vec<f32>>,    // 延迟初始化: 量化边界
    centroids: OnceLock<Vec<f32>>,     // 延迟初始化: 量化质心
    blocked: OnceLock<BlockedCache>,   // 延迟初始化: SIMD 块布局
}

关键设计：

OnceLock 延迟缓存：旋转矩阵、Lloyd-Max 质心、SIMD 块布局只在首次 search 或显式 prepare() 时初始化一次。后续 add 仅重置 blocked 缓存（因为码本变化），其他缓存不变——这是并发安全的关键。
add 流程：校验输入 → 取/建旋转矩阵 → 取/建码本（边界+质心） → 调用 encode::encode 生成 packed_codes、scales、tqplus_shift/scale → 扩展内部存储 → 重置 blocked。
Lazy 模式：TurboQuantIndex::new_lazy(bit_width) 允许在构造时不确定维度，首次 add_2d 时锁定（从输入数组的列数推断）。

4.3.2 `codebook.rs` — Lloyd-Max 标量量化器

位于 turbovec/src/codebook.rs。该模块实现了对 Beta((d-1)/2, (d-1)/2) 分布的最优标量量化。

核心函数 codebook(bits, dim) -> (Vec<f32>, Vec<f32>)：

fn lloyd_max(bits: usize, dim: usize, max_iter: usize, tol: f64)
    -> (Vec<f32>, Vec<f32>)
{
    let a = (dim as f64 - 1.0) / 2.0;
    let beta = Beta::new(a, a).unwrap();  // 理论分布
    let n_levels = 1usize << bits;         // 2-bit = 4 级, 4-bit = 16 级

    // 迭代: 质心 → 边界中点 → 新质心(区间条件均值)
    for _ in 0..max_iter {
        let boundaries = (0..n_levels-1).map(|i|
            (centroids[i] + centroids[i+1]) / 2.0).collect();
        // 使用 adaptive_simpson 数值积分计算每个区间的条件均值
        let new_centroids = intervals.map(|(lo,hi)|
            integral(x*pdf(x), lo, hi) / (cdf(hi)-cdf(lo)));
        // 收敛检查
        if max_change < tol { break; }
    }
    (boundaries_f32, centroids_f32)
}

要点：使用 adaptive_simpson（自适应 Simpson 积分）精确计算每个量化区间的条件均值，这是 Lloyd-Max 算法的核心步骤；分布是对称 Beta(a,a) 在 [-1, 1] 上的平移/缩放。

4.3.3 `encode.rs` — 归一化 + 旋转 + TQ+ + 量化 + 打包

位于 turbovec/src/encode.rs。这是整个算法流程的主干。

关键函数 encode(vectors, n, dim, rotation, boundaries, centroids, bit_width, existing_calibration)：

步骤 1: 归一化 (SIMD norm) —— 对每向量 v 计算 norm = ||v||₂, u = v/norm
步骤 2: 随机旋转 —— u × Rᵀ, R 为确定性种子(42)构造的 d×d 正交矩阵
        旋转后每个坐标独立 ~ Beta((d-1)/2, (d-1)/2)
步骤 3: TQ+ 校准 (若 existing_calibration=None)
        对每个坐标 d, 取经验 5%/95% 分位数 qe_lo/qe_hi
        令 scale_tq[d] = (qc_hi - qc_lo) / (qe_hi - qe_lo)
        令 shift[d]   = qc_lo/scale_tq[d] - qe_lo
        u_calibrated[d] = (u[d] + shift[d]) * scale_tq[d]
        (首次 add 后锁定; 不足 1000 样本退化为 identity)
步骤 4: Lloyd-Max 量化 —— 对 u_calibrated 每个坐标扫描 boundaries 得 code ∈ [0, n_levels-1]
步骤 5: 位打包 —— n_levels ≤ 16 = 2^4, 每坐标 bits 位写入 bit-plane 布局
步骤 6: 长度重归一化 —— fused_quantize_scale_pack 同时计算
        inner = Σ u_original[d] * (centroids[code[d]] / scale_tq[d] - shift[d])
        scale = norm / inner  (存入 scales)

性能细节：

归一化/打包在 ARM 上使用手写 NEON intrinsics（simd_norm / fused_quantize_scale_pack 的 aarch64 分支）
行级并行通过 rayon::par_chunks_mut 展开
旋转矩阵乘法走 ndarray 的 BLAS 后端（macOS Accelerate / Linux OpenBLAS）

4.3.4 `search.rs` — SIMD 内核与 top-k 搜索

位于 turbovec/src/search.rs，约 1800 行，是工程复杂度最高的模块。

搜索主流程（search_with_mask）：

查询 q (nq × d)
  │
  ▼  旋转查询: q_rot = q × Rᵀ
  ▼  反校准: q_calib[d] = q_rot[d] / scale_tq[d]
  ▼  构造 nibble-split LUT: LUT[code] = Σ centroids[code[d]] * q_calib[d]
  ▼  SIMD 块扫描 packed_codes 与 LUT, 计算每向量得分
  ▼  乘以 vec_scales[d] (= 长度重归一化)
  ▼  维护 top-k heap (朴素比较 —— k 通常较小)

架构特有实现：

架构	文件	指令	布局
ARM aarch64	`search.rs` ~L40-150	NEON (`vld1q_u8`, `vqtbl1q_u8`, `vfmaq_f32`)	顺序 32 向量块
x86_64 AVX2	`search.rs` ~L200+	`_mm256_shuffle_epi8`, `_mm256_fmadd_ps`	FAISS-style perm0-interleaved
x86_64 AVX-512BW	`search.rs` ~L300+	`_mm512_broadcast_i64x4`, 512-bit 双块	并行 2×32 向量块

搜索时过滤（mask / allowlist）：

// 在 32-向量块粒度, 先测试整块是否全被过滤, 若是则跳过整个 SIMD 循环
if !block_has_allowed(mask, base_vec) {
    continue;  // 跳过该块 —— 零计算开销
}
// 块内部: 对每个 256/512 位 lane 做普通打分
// heap 插入前再做单槽检查 (mask_allows)
if let Some(m) = mask {
    if !mask_allows(m, base_vec + lane) { continue; }
}

这是 turbovec 与其他向量索引库的关键区别：过滤发生在 SIMD 循环内部而非搜索后，不会因 filter 严苛而显著变慢。

4.3.5 `id_map.rs` — IdMapIndex 稳定外部 ID 封装

IdMapIndex 在 TurboQuantIndex 之上添加了一个 HashMap<u64, usize>（外部 id → 内部 slot）与反向映射 Vec<u64>（slot → 外部 id）。

add_with_ids(vectors, ids)：内部调用 TurboQuantIndex::add_2d，然后填入 id↔slot 双向表
remove(id)：从 HashMap 取出 slot → 调用 swap_remove(slot) 并修正另一被移动向量的映射
search_with_allowlist(queries, k, allowlist)：将 u64 id 数组转为 slot 级 bool mask，然后走 TurboQuantIndex::search_with_mask

4.3.6 `io.rs` — 序列化格式

.tv 格式 (TurboQuantIndex):
┌──────────────┐
│ magic "TVPI" │ 4 字节
│ version u8   │ 1 字节  (=3)
├──────────────┤
│ bit_width u8 │
│ dim u32 LE   │    // 0 表示 lazy empty
│ n_vectors u32 LE │
├──────────────┤
│ packed_codes │ (dim/8)*bit_width*n_vectors 字节
├──────────────┤
│ scales       │ n_vectors * 4 字节 (f32 LE)
├──────────────┤
│ TQ+ trailer  │ n_calib u32 LE + shift[n_calib] + scale[n_calib]
└──────────────┘

.tvim 格式 (IdMapIndex):
┌──────────────┐
│ magic "TVIM" │
│ version u8   │
├──────────────┤
│ 同 .tv 的完整 payload │
├──────────────┤
│ slot_to_id   │ n_vectors * 8 字节 (u64 LE)
└──────────────┘

4.3.7 `turbovec-python/src/lib.rs` — Python 绑定层

使用 #[pyclass] 宏包装 TurboQuantIndex 与 IdMapIndex：

通过 numpy::PyReadonlyArray2<f32> 直接零拷贝访问 NumPy 数组（要求 C 连续）
search 返回 (PyArray2<f32>, PyArray2<i64/u64>) 的 (nq, effective_k) 形状
Python 层的错误（维度不匹配、非连续数组、allowlist 中含未知 id）转为 ValueError / KeyError / IndexError

4.4 核心代码节选

TurboQuantIndex::new 与 add 流程

// turbovec/src/lib.rs — 构造
pub fn new(dim: usize, bit_width: usize) -> Result<Self, ConstructError> {
    if !(2..=4).contains(&bit_width) { /* 校验 */ }
    if dim == 0 || dim % 8 != 0 { /* 校验 */ }
    Ok(Self {
        dim: Some(dim), bit_width, n_vectors: 0,
        packed_codes: Vec::new(), scales: Vec::new(),
        tqplus_shift: Vec::new(), tqplus_scale: Vec::new(),
        rotation: OnceLock::new(), boundaries: OnceLock::new(),
        centroids: OnceLock::new(), blocked: OnceLock::new(),
    })
}

// turbovec/src/lib.rs — 添加向量
pub fn add(&mut self, vectors: &[f32]) {
    let dim = self.dim.expect("dim must be set");
    let rotation = self.rotation
        .get_or_init(|| rotation::make_rotation_matrix(dim));
    let (boundaries, centroids) = codebook::codebook(self.bit_width, dim);
    // 首次 add 时, existing_calibration=None → 拟合 TQ+ shift/scale
    // 后续 add 时, existing_calibration=Some(已有的 shift/scale) → 复用
    let existing = if self.tqplus_shift.is_empty() { None }
        else { Some((self.tqplus_shift.as_slice(),
                     self.tqplus_scale.as_slice())) };
    let (packed, scales, shift, scale_tq) = encode::encode(
        vectors, n, dim, rotation, &boundaries, &centroids,
        self.bit_width, existing);
    self.packed_codes.extend_from_slice(&packed);
    self.scales.extend_from_slice(&scales);
    if self.n_vectors == 0 {
        self.tqplus_shift = shift;
        self.tqplus_scale = scale_tq;   // 仅首次锁定
    }
    self.n_vectors += n;
    self.blocked = OnceLock::new();     // 使块布局缓存失效
}

fused_quantize_scale_pack — 编码核心（ARM NEON 路径）

// turbovec/src/encode.rs — aarch64 NEON  fused 量化 + 打包 + 尺度
#[cfg(target_arch = "aarch64")]
#[inline(always)]
fn fused_quantize_scale_pack(
    rot_orig: &[f32], rot_calib: &[f32],
    shift: &[f32], inv_scale_tq: &[f32],
    boundaries: &[f32], centroids: &[f32], norm: f32,
    packed_row: &mut [u8], dim: usize, bits: usize,
    bytes_per_plane: usize) -> f32
{
    use std::arch::aarch64::*;
    let mut inner: f64 = 0.0;
    let chunks = dim / 8;
    unsafe {
        for c in 0..chunks {
            let offset = c * 8;
            let vals_lo = vld1q_f32(rot_calib.as_ptr().add(offset));
            let vals_hi = vld1q_f32(rot_calib.as_ptr().add(offset + 4));
            // NEON 边界扫描: 对每个边界值 b, 比较 8 个坐标
            // 大于 b 的 lane 贡献 1 到累积计数
            let mut acc_lo = vdupq_n_u32(0);
            let mut acc_hi = vdupq_n_u32(0);
            for &b in boundaries {
                let bv = vdupq_n_f32(b);
                acc_lo = vaddq_u32(acc_lo,
                    vshrq_n_u32::<31>(vcgtq_f32(vals_lo, bv)));
                acc_hi = vaddq_u32(acc_hi,
                    vshrq_n_u32::<31>(vcgtq_f32(vals_hi, bv)));
            }
            // 提取 8 个 code 并计算内积 inner (在 original space)
            let counts: [u8; 8] = [ /* vgetq_lane_u32 结果 */ ];
            for k in 0..8 {
                let d = offset + k;
                let centroid_in_orig =
                    (centroids[counts[k] as usize] as f64)
                    * (inv_scale_tq[d] as f64) - (shift[d] as f64);
                inner += (rot_orig[d] as f64) * centroid_in_orig;
            }
            // 打包 8 个 codes 到 bytes_per_plane × bits 个位平面
            let codes_vec = vld1_u8(counts.as_ptr());
            for p in 0..bits {
                let mask = vdup_n_u8(1u8 << p);
                let hit = vcgt_u8(vand_u8(codes_vec, mask),
                                   vdup_n_u8(0));
                packed_row[p*bytes_per_plane + offset/8]
                    = vaddv_u8(vand_u8(hit, weights));
            }
        }
    }
    let inner = inner.max(1e-10) as f32;
    norm / inner  // 返回该向量的 scale
}

NEON 4-bit 搜索块打分

// turbovec/src/search.rs — score_4bit_block_neon (简化示意)
#[cfg(target_arch = "aarch64")]
unsafe fn score_4bit_block_neon(
    blocked_codes: &[u8], uint8_luts: &[u8],
    block_offset: usize, n_byte_groups: usize,
    scale: f32, bias: f32, vec_scales: &[f32],
    base_vec: usize, n_vectors: usize,
    out: &mut [f32; BLOCK])  // BLOCK = 32
{
    let mask = vdupq_n_u8(0x0F);      // 低 nibble mask
    let v_scale = vdupq_n_f32(scale);
    let mut fa = [vdupq_n_f32(bias); 8];  // 8 × 4-float = 32 输出
    let codes_base = blocked_codes.as_ptr().add(block_offset);
    let luts_base = uint8_luts.as_ptr();

    for batch in 0..(n_byte_groups + FLUSH_EVERY - 1) / FLUSH_EVERY {
        let mut accum = [vdupq_n_u16(0); 4];
        // 对每个 byte_group (= 每坐标的一个 bit-plane) 做 LUT 查表
        for g in batch*FLUSH_EVERY .. end {
            let lp0 = luts_base.add(g * 32);
            let cp0 = codes_base.add(g * BLOCK);
            let lut_hi = vld1q_u8(lp0);
            let lut_lo = vld1q_u8(lp0.add(16));
            let c0 = vld1q_u8(cp0);
            // 查表并加总: 每个 8-bit code 的 hi/lo nibble 各贡献 LUT 一项
            let s0 = vaddq_u8(vqtbl1q_u8(lut_lo, vandq_u8(c0, mask)),
                              vqtbl1q_u8(lut_hi, vshrq_n_u8(c0, 4)));
            accum[0] = vaddw_u8(accum[0], vget_low_u8(s0));
            accum[1] = vaddw_u8(accum[1], vget_high_u8(s0));
        }
        // Flush: u16 → f32, 乘以 scale, 叠加 bias 到 fa
        for i in 0..4 {
            let lo = vcvtq_f32_u32(vmovl_u16(vget_low_u16(accum[i])));
            let hi = vcvtq_f32_u32(vmovl_u16(vget_high_u16(accum[i])));
            fa[i*2]   = vfmaq_f32(fa[i*2],   v_scale, lo);
            fa[i*2+1] = vfmaq_f32(fa[i*2+1], v_scale, hi);
        }
    }
    // 最终乘 vec_scales (= 长度重归一化) 写出 32 个得分
    for i in 0..8 {
        let n = vld1q_f32(vec_scales_ptr.add(i*4));
        vst1q_f32(out_ptr.add(i*4), vmulq_f32(fa[i], n));
    }
}

4.5 文件结构总结表

文件	角色	关键结构/函数
`turbovec/src/lib.rs`	核心索引结构	`TurboQuantIndex`, `SearchResults`, `new`, `add`, `search_with_mask`
`turbovec/src/codebook.rs`	量化器	`codebook()`, `lloyd_max()`, `adaptive_simpson()`
`turbovec/src/encode.rs`	编码流水线	`encode()`, `compute_tqplus_calibration()`, `fused_quantize_scale_pack()`
`turbovec/src/rotation.rs`	随机正交旋转	`make_rotation_matrix()`
`turbovec/src/search.rs`	SIMD 搜索内核	`search_with_mask`, NEON/AVX2/AVX-512BW 实现, `block_has_allowed`
`turbovec/src/pack.rs`	位打包工具	辅助 `encode` / `search` 的布局转换
`turbovec/src/io.rs`	序列化	`write`, `load`, .tv/.tvim 格式
`turbovec/src/error.rs`	错误类型	`ConstructError`, `AddError`
`turbovec/src/id_map.rs`	稳定 ID 封装	`IdMapIndex`, `add_with_ids`, `remove`, `search_with_allowlist`
`turbovec-python/src/lib.rs`	PyO3 绑定	`#[pyclass] TurboQuantIndex`, `#[pyclass] IdMapIndex`
`docs/api.md`	API 参考	`.tv/.tvim` 格式说明、方法表

5. 应用场景、优点与不足

5.1 典型应用场景

场景	为什么用 turbovec
RAG 系统内存敏感部署	10M 向量 float32 需 31 GB；turbovec 4-bit 压缩到 4 GB，单机即可承载千万级知识库
边缘 / 本地 RAG	纯本地、无托管服务，配合开源嵌入模型构建完全 air-gapped 的检索增强生成管线
混合检索（向量 + 关键词/ACL/时间窗）	`search(mask=...)` 或 `search(allowlist=...)` 支持在 SIMD 循环内做外部候选集 rerank
LangChain / LlamaIndex / Haystack 用户	drop-in 替换 `InMemoryVectorStore`，一行 import 切换，保留原使用方式
增量向量库	无需训练阶段，`index.add(new_batch)` 随时扩展，无需重建
需要稳定外部 ID 的场景	`IdMapIndex` 支持以 `u64` 作为稳定 ID（例如数据库主键），删除不会影响其他 ID 的映射

5.2 项目优点

算法扎实：直接基于顶会（ICML / NeurIPS 级）理论工作 TurboQuant，并在此基础上贡献了 TQ+ 校准与长度重归一化两项工程改进
性能领先：在 ARM 和 x86 上均持平或优于业界标准 FAISS，特别是 Apple Silicon 平台有 12-20% 显著优势
内存效率高：4-bit 压缩 ≈ 8× 内存缩减；2-bit ≈ 16×
零训练开销：没有代码本训练、没有离线 preprocess，add 直接可用
SIMD 与并发：手写三架构内核（NEON/AVX2/AVX-512BW），OnceLock + rayon 提供零锁并发搜索
搜索时过滤：独特的 mask/allowlist 在 SIMD 内短路实现，对 hybrid retrieval（先 BM25 再 dense rerank）特别友好
语言桥接成熟：PyO3 + maturin + abi3，跨平台 wheel 构建简单；LangChain/LlamaIndex/Haystack/Agno 四大框架官方集成
可序列化：.tv / .tvim 带 magic + version 字节，可跨版本兼容加载；TQ+ 参数作为 trailer 持久化

5.3 项目不足与改进空间

维度约束：dim 必须是 8 的倍数；对例如 dim=384 的 GloVe 向量需要补零，略有浪费
TQ+ 校准是一次性：首次 add 拟合后锁定。若后续批次数据分布显著偏移（例如不同语料），校准不再最优；需要用户手动重建
top-k heap 朴素实现：对较小 k（≤ 100）足够，但对大 k 场景性能可能不如专门的堆/选择算法
无 GPU 后端：当前仅 CPU SIMD；相比 FAISS-GPU / Milvus 等缺乏大规模（> 1 亿向量）GPU 支持
Windows 支持不完善：BLAS 链接指令仅对 macOS/Linux 发射，Windows 下仅能以无 BLAS 方式构建（旋转矩阵乘法会变慢）
磁盘索引规模：当前设计为内存索引，无 mmapped / 分层磁盘存储；对 1 亿+ 超大规模场景需与其他存储层配合
无距离函数选择：仅支持内积（通过归一化等价于余弦相似度）；无 L2 / Hamming 等选项
API 仍在演进：version 字节为 3，文件格式仍有 breaking change 可能（但已通过 magic+version 保障向后加载）
Python 层要求 C 连续数组：非连续 NumPy 会触发 ValueError，用户需 np.ascontiguousarray 转换
缺少生产级工具链：无内置的索引分片、多机分布式、向量更新（仅支持 add/remove）等企业特性

附录：参考链接

资源	链接
GitHub 仓库	https://github.com/RyanCodrai/turbovec
TurboQuant 论文	https://arxiv.org/abs/2504.19874
API 文档	https://github.com/RyanCodrai/turbovec/blob/main/docs/api.md
LangChain 集成	https://github.com/RyanCodrai/turbovec/blob/main/docs/integrations/langchain.md
LlamaIndex 集成	https://github.com/RyanCodrai/turbovec/blob/main/docs/integrations/llama_index.md
Haystack 集成	https://github.com/RyanCodrai/turbovec/blob/main/docs/integrations/haystack.md
crates.io	https://crates.io/crates/turbovec
PyPI	https://pypi.org/project/turbovec/

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

博客园 - zhang-yd

turbovec 项目深度分析报告

1. 项目介绍

1.1 项目概览

1.2 项目基础信息

1.3 项目架构示意图

1.4 向量压缩流程示意

2. 项目亮点

2.1 算法层面

2.2 工程层面

3. 运行环境与条件

3.1 运行环境

3.2 依赖库（主要）

3.3 运行条件与约束

3.4 安装与运行示例

4. 代码架构与核心模块解析

4.1 仓库目录结构

4.2 代码架构图（模块依赖）

4.3 核心模块解析

4.3.1 lib.rs — TurboQuantIndex 主结构

4.3.2 codebook.rs — Lloyd-Max 标量量化器

4.3.3 encode.rs — 归一化 + 旋转 + TQ+ + 量化 + 打包

4.3.4 search.rs — SIMD 内核与 top-k 搜索

4.3.5 id_map.rs — IdMapIndex 稳定外部 ID 封装

4.3.6 io.rs — 序列化格式

4.3.7 turbovec-python/src/lib.rs — Python 绑定层

4.4 核心代码节选

TurboQuantIndex::new 与 add 流程

fused_quantize_scale_pack — 编码核心（ARM NEON 路径）

NEON 4-bit 搜索块打分

4.5 文件结构总结表

5. 应用场景、优点与不足

5.1 典型应用场景

5.2 项目优点

5.3 项目不足与改进空间

附录：参考链接

4.3.1 `lib.rs` — TurboQuantIndex 主结构

4.3.2 `codebook.rs` — Lloyd-Max 标量量化器

4.3.3 `encode.rs` — 归一化 + 旋转 + TQ+ + 量化 + 打包

4.3.4 `search.rs` — SIMD 内核与 top-k 搜索

4.3.5 `id_map.rs` — IdMapIndex 稳定外部 ID 封装

4.3.6 `io.rs` — 序列化格式

4.3.7 `turbovec-python/src/lib.rs` — Python 绑定层