Lance wrote in the chain: Merge Into, Compaction, and Stable Row ID

Recommended Feeds

Google DeepMind News

人人都是产品经理

MIT News - Artificial intelligence

月光博客

让小产品的独立变现更简单 - ezindie.com

量子位

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

雷峰网

博客园 - Aitozi

An Empirical Evaluation of Columnar Storage Formats 从本地目录理解 Lance Dataset：Manifest、Fragment 与 Blob 论文解读：Lance 如何通过自适应结构编码提升列式存储随机访问中国最大广告机器简史学习Facebook，超越Meta｜字节跳动第3集 Paimon merge into 实现原理 Paimon Deletion Vector Paimon lookup store 实现 Flink Batch Hash Aggregate 理解 Paimon changelog producer 笔记工具 FlinkSQL类型系统二叉堆原理与实现 SkipList原理与实现 Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores Paimon Compaction实现 Paimon读取流程 Paimon的写入流程 Calcite sql2rel 过程用rust 写一个jar包 class冲突检测工具 rust 中 str 与 String; &str &String 好奇心: 保持对未知世界用不停息的热情 Apache hudi 核心功能点分析

Lance wrote in the chain: Merge Into, Compaction, and Stable Row ID

Aitozi · 2026-05-24 · via 博客园 - Aitozi

Lance's write path involves file layout, version commit, deletion marking, index maintenance, and compaction. Unlike traditional databases, Lance does not modify data directly on the original file but generates new table versions by adding new files and updating metadata.

This article discusses several implementation issues:

delete、update、merge insert how they are applied to files and metadata.
The role of Deletion Vector in Lance's write path.
How Deletion Vector participates in write conflict detection.
The differences in Deletion Vector design between Lance and Apache Paimon.
How compaction selects fragments and why index remapping is needed.
How to Reduce Index Maintenance Costs After Compaction.

Basic Judgment

Lance's basic write form is: writing to a new data file or a deletion file, and then committing the new version through transactions and manifests.

Delete:
  不立刻重写 data file
  -> 记录 fragment 内被删除的 row offset
  -> 读时过滤 deletion vector

Update / Merge:
  写出更新后的新 rows 或新 columns
  -> 对旧 rows 写 deletion vector
  -> 提交 Operation::Update

Compaction:
  读取旧 fragments 的 live rows
  -> 写成新的 fragments
  -> 旧 row address 可能失效
  -> 索引需要 remap 或重建

Stable Row ID:
  让索引指向稳定逻辑行 ID
  -> compaction 后只需要维护 row_id -> row_address 映射
  -> 降低索引 remap 成本

In Lance's persistent metadata, it is usually called DeletionFile, and internally it uses DeletionVector:

DeletionVector:
  内存中的删除集合语义

DeletionFile:
  DeletionVector 在表目录下的持久化文件引用

Unified Model for Write Paths

From the LanceDB API perspective, the user calls:

table.add(...)
table.delete("id = 42")
table.update(where="id = 42", values={"text": "new"})
table.merge_insert("id").when_matched_update_all().when_not_matched_insert_all()
table.optimize.compact_files()

But what Lance sees at the bottom is not "modifying a few rows in a table," but a single dataset state transition:

读取当前 manifest
  -> 执行 scan / join / filter
  -> 写新的 data files / deletion files / index files
  -> 构造 Operation
  -> 写 transaction
  -> 提交新的 manifest version

A version's manifest describes the complete state of the current table:

Manifest version N
  schema
  fragments
  indices
  version metadata

Fragment 1
  data files
  deletion file
  row id metadata

transaction describes the changes to table state for this commit. Therefore, creating indexes, compacting files, and updating configurations can result in a new version even if the number of rows hasn't changed.

Delete: Marks invisible rows with a Deletion Vector.

Delete does not immediately rewrite the data file but marks the targeted rows as deleted.

Simplified linkage as follows:

delete where id = 42
  -> scan + predicate 找到命中 rows
  -> 捕获 row address
  -> 按 fragment 聚合成 local row offsets
  -> 更新 fragment.deletion_file
  -> Operation::Delete
  -> CommitBuilder.with_affected_rows(...)
  -> 提交新 manifest

Assuming Fragment 7 has 1000 rows:

Fragment 7
  data file: 1000 physical rows
  deletion file: {3, 19, 42}

读取 Fragment 7 时:
  offset 3、19、42 被过滤
  其他行仍然可见

In the source code, Fragment the metadata contains an optional deletion_file field, semantically meaning "the local row offsets deleted within this fragment." DeletionFileType Currently, there are two types:

Array:
  适合较稀疏的删除集合

Bitmap:
  适合较密集的删除集合

source code entry:

rust/lance-table/src/format/fragment.rs
  DeletionFile
  DeletionFileType
  Fragment.deletion_file

rust/lance-table/src/io/deletion.rs
  write_deletion_file
  read_deletion_file

rust/lance/src/dataset/write/delete.rs
  apply_deletions
  DeleteBuilder

The function of this design is:

When deleting a small number of lines, there is no need to rewrite the entire fragment.
No need to rewrite old data files on object storage either.
A collection can evolve independently upon deletion, and manifest only needs to point to the new deletion file.
Subsequent compaction can further materialize the deletions.

The corresponding cost is: loading the deletion vector for the read path and skipping tombstoned rows during scanning.

Update: Write a new line, then delete the old line

A common implementation of Update is to write the new data after the update, and then tombstone the old row.

OrdinaryupdateThe main chain is approachingRewriteRows：

update set text = 'new' where id = 42
  -> scan where 条件命中的 rows，并带上 row id / row address
  -> 对 batch 应用更新表达式
  -> write_fragments_internal 写出新的 fragments
  -> 对旧 row address 应用 deletion vector
  -> Operation::Update { update_mode: RewriteRows }
  -> CommitBuilder.with_affected_rows(...)

as an example of updating only 1 column in a 10-column list:

更新前:
  Fragment 1
    row offset 42 = (c1, c2, c3, ..., c10)

UPDATE SET c3 = c3_new WHERE id = 42

更新后:
  Fragment 1
    deletion file 标记 offset 42 deleted

  Fragment 9
    新写入 row = (c1, c2, c3_new, ..., c10)

RewriteRows does not rewrite the entire old fragment, but instead writes out the matched rows as new rows. For each updated row, it writes out the complete row.

Source code entry:

rust/lance/src/dataset/write/update.rs
  UpdateJob::execute_impl
  scanner.with_row_id()
  write_fragments_internal(...)
  apply_deletions(...)
  Operation::Update { update_mode: RewriteRows }

Lance also has RewriteColumns mode, mainly appearing in merge/update scenarios of some schemas. It targets updates with "large rows, few columns," but increases the maintenance costs of fragments, column files, index coverage, and conflict detection.

Merge Insert: Upsert semantics, not equal to primary key constraint

LanceDB's merge_insert can be used for upsert:

(
    table.merge_insert("id")
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute(source)
)

hereid is the join key for matching source and target, not a strong constraint primary key in the database. The simplified process of

is:

source 与 target 按 key join
  matched:
    更新 target rows

  not matched:
    插入 source rows

  not matched by source:
    keep 或 delete

on the full schema path, the merge insert process is similar to a regular update:

matched rows:
  写出更新后的完整 rows 到新 fragments
  对旧 target rows 写 deletion vector

not matched rows:
  作为新 rows 写入新 fragments

commit:
  Operation::Update { update_mode: RewriteRows }

on the partial schema path, it will follow RewriteColumns:

source 只包含部分列
  -> update_fragments(...)
  -> Operation::Update { update_mode: RewriteColumns, fields_modified }

Source code entry:

rust/lance/src/dataset/write/merge_insert.rs
  MergeInsertBuilder
  update_fragments
  Operation::Update { update_mode: RewriteRows | RewriteColumns }

Deletion Vector and write conflict detection

The Deletion Vector not only filters out deleted rows during reads but also allows Lance to reduce some write conflicts from "fragment-level conflicts" to "row-level conflicts".

First look at the problems when there is no Deletion Vector / affected rows:

T1:
  delete row (Fragment 1, offset 10)

T2:
  delete row (Fragment 1, offset 20)

If you only look at the fragment level, both transactions modified Fragment 1, so they are easily identified as conflicting. However, they actually deleted different rows, which can be merged:

T1 deletion vector: {10}
T2 deletion vector: {20}

rebase 后:
  Fragment 1 deletion vector: {10, 20}

Lance passes the matched row address to the commit layer during delete/update submissions:

CommitBuilder.with_affected_rows(RowAddrTreeMap)

This allows conflict detection to determine:

两个并发事务是否真的修改了同一批 rows？

If it's just modifying different rows of the same fragment, and another transaction hasn't changed the data files but only the deletion file, then the current transaction has a chance to rebase. That is, it can write a merged deletion vector based on the new fragment deletion file.

Typical case:

可 rebase:
  T1 delete F1.offset 10
  T2 delete F1.offset 20
  -> affected_rows 不重叠
  -> 合并 deletion vector

不可 rebase:
  T1 update F1.offset 10
  T2 delete F1.offset 10
  -> affected_rows 重叠
  -> 语义冲突

不可简单 rebase:
  T1 delete F1.offset 10
  T2 compaction/rewrite F1
  -> fragment 的 data files 被重写
  -> row address / fragment 状态发生大范围变化

Source code entry:

rust/lance/src/dataset/write/commit.rs
  CommitBuilder.with_affected_rows

rust/lance/src/io/commit/conflict_resolver.rs
  TransactionRebase
  check_delete_txn
  check_update_txn

Therefore, the Deletion Vector does not eliminate all write conflicts. It provides the expressiveness of row-level affected rows, allowing Lance to avoid some fragment-level false conflicts.

For datasets containing a large number of samples, updates, deletions, and merges often only affect a small number of rows. If concurrency control can only be done at the fragment level, it will judge many non-overlapping row-level modifications as conflicts.

Differences between Lance and Paimon's Deletion Vector

Lance and Apache Paimon both have Deletion Vectors, but the problems they aim to solve are not entirely the same.

Dimensions	Lance	Apache Paimon
Main Data Models	Columnar datasets for Arrow / Lance fragments	Supports lakehouse tables, primary key tables, LSM, bucket, snapshot
DV Granularity	Local row offset within fragment	Row position within data file
Typical Use Cases	Delete, update, merge, conflict detection, read-time filtering, compaction materialize deletions	Avoid read-time merge and generate DV files during write in primary key table MOW mode
and its relationship with updates	the common path for updates is to write new rows + tombstone old rows	the primary key table relies on LSM to find old data and generate DV
its relationship with column-level evolution	Lance has multiple data files and row id metadata for fragments, and still maintains them around fragment/row address after updates	Paimon Data Evolution uses column overlay, similar to `first row id` where file reads are merged
its relationship with conflict detection	`affected_rows` it allows concurrent delete/update to perform row-level rebase	is more inclined towards primary key table write paths and file-level read filtering

Paimon's DV (deletion vector) document semantics are clear: it records the deleted row positions in a data file and filters these rows when reading the file. Paimon's Merge On Write mode relies on LSM (Log-Structured Merge-tree), which allows querying the primary key during the write phase to generate the corresponding data file's deletion vector, thus eliminating the need for a full merge during reads.

Paimon's Data Evolution table takes a different approach: it writes only the updated columns to a new file while keeping the original data file unchanged, and during reads, it merges multiple files with the same first row id into complete rows. Therefore, Paimon Data Evolution explicitly requires disabling deletion vectors and does not currently support regular Delete / Update statements.

Below is a simplified example of combining Data Evolution with DV:

base file:
  firstRowId = 100
  columns: id, a, b, c

update file:
  firstRowId = 100
  columns: b_new

读时:
  base file + update file
  -> 按 firstRowId 对齐
  -> 得到完整行

If file-level deletion vector is introduced:

删除 base file 的某一行:
  a / c 也被隐藏
  但 b_new 的 overlay 文件如何处理？

删除 update file 的某一行:
  b_new 不可见
  但 base file 的旧 b 是否应该恢复？

If both mechanisms are supported simultaneously, the read and write paths need to handle both "row deletion" and "column overlay merge". Paimon separates Data Evolution and Deletion Vector to avoid mutual interference between these two semantic sets.

Lance's typical update path is:

新 rows / 新 columns 写入新 fragment 或新 data files
旧 rows 通过 deletion vector tombstone
manifest 统一描述当前版本的 fragment 状态

Therefore, Lance's Deletion Vector is not just a read filtering tool but also participates in rebasing and conflict detection during concurrent writes.

Compaction: When to select fragments

Deletion Vector reduces the write amplification of delete/update but leaves two issues:

Small batches of append may create a large number of small fragments.
After multiple delete/update operations, the proportion of deleted rows in the fragment may be high.

Compaction is used to handle these layout degradation issues.

Lance's compaction is not simply selecting based on file size, but mainly looks at the number of lines in the fragment and the deletion ratio:

if deletion_percentage > materialize_deletions_threshold:
  选中这个 fragment
  目的：把 deletion vector 物化掉，只写 live rows

else if physical_rows < target_rows_per_fragment:
  选中这个 fragment
  目的：和相邻的小 fragments 合并成更大的 fragment

else:
  不参与本轮 compaction

Default configuration:

target_rows_per_fragment = 1024 * 1024
materialize_deletions_threshold = 0.1

The corresponding action is:

Fragments with deletion ratios exceeding 10% are worth rewriting, even if they are just one fragment by themselves.
Fragments with lines below the target value will be attempted to be merged with adjacent candidate fragments.
Compaction should be planned based on row count, not by directly selecting based on file byte size.

Source code entry:

rust/lance/src/dataset/optimize.rs
  CompactionOptions
  plan_compaction
  CompactionCandidacy::CompactItself
  CompactionCandidacy::CompactWithNeighbors

There is an index constraint here: compaction will not mix "fragments covered by an index" and "fragments not covered by this index" in the same rewrite group.

The reason is that Lance's index metadata includes fragment_bitmap, which describes which fragments the index covers. If a rewrite group mixes indexed and unindexed fragments, the new fragment created after compaction cannot be clearly classified as indexed or unindexed.

Therefore, the compaction planner will categorize candidate fragments into bins based on index coverage:

F1 indexed by vector_index
F2 indexed by vector_index
F3 not indexed

允许:
  compact(F1, F2) -> F10

不允许:
  compact(F2, F3) -> F11

Why does compaction lead to index remapping

The essence of compaction is rewrite:

old fragments:
  F1, F2

new fragments:
  F10

If the index stores row addresses, after compaction, the addresses in the index need to be maintained.

Lance's row address is composed of fragment ID and row offset:

row_address = (fragment_id, row_offset)

Before compaction:

F1.offset 0
F1.offset 1
F2.offset 0

After compaction:

F10.offset 0
F10.offset 1
F10.offset 2

Logically, they are still the same batch of live rows, but the physical addresses have changed. The row addresses stored in the index must be updated accordingly, otherwise the index will point to an old fragment.

Lance has several ways to handle the index after compaction:

1. 同步 remap index
   compaction 时生成 old row address -> new row address 映射
   重写受影响的 index files

2. defer_index_remap
   compaction 时不立即重写所有索引
   建立 fragment reuse index
   查询时或后续维护时再做 remap

3. stable row id
   索引不再直接依赖易变的 physical row address
   而是指向稳定 row id

Apply in transactionsOperation::Rewrite, Lance processes two types of metadata:

fragments:
  old fragments 从 manifest 移除
  new fragments 加入 manifest

indices:
  更新 fragment_bitmap
  必要时替换 index uuid 和 index files

Source code entry:

rust/lance/src/dataset/transaction.rs
  Operation::Rewrite
  handle_rewrite_fragments
  recalculate_fragment_bitmap
  handle_rewrite_indices

rust/lance/src/dataset/optimize/remapping.rs
  fragment reuse index
  deferred index remap

The maintenance methods for different indexes vary. Whether remapping is possible depends on whether the index internally stores details that can map from old row addresses to new row addresses.

They can be distinguished based on the content recorded internally in the index:

较容易 remap:
  能逐条或逐 segment 映射到 row address 的索引

较难 remap:
  内部结构强依赖训练结果、聚类结果、posting layout 或 block layout 的索引

For example, vector indexes are not just a key -> row address mapping. Indexes like IVF internally have structures such as cluster centers, partitions, vector lists, and row ID lists. Although the vector values remain unchanged after compaction, the row addresses change, and the row ID/address payloads within the index need to be consistently updated. If the index format does not provide cheap, local, and reliable remapping capabilities, rewriting or relying on fragment reuse index for deferred processing is the only option.

Therefore, the challenge of compaction is not just rewriting data files:

After rewriting data files, the index must still be able to locate the same batch of logical rows.

Stable Row ID: Decoupling the index from physical addresses

Stable Row ID assigns a stable ID to each logical row, making the index no longer directly dependent on the changing row address.

Without stable row ID:

row id ~= row address

索引命中:
  vector index -> row address -> 回表

compaction:
  row address 改变
  -> index payload 需要 remap

After enabling stable row ID:

row id = 稳定逻辑行 ID
row address = 当前物理位置

索引命中:
  vector index -> stable row id
  -> RowIdIndex 查 row id 当前对应的 row address
  -> 回表

compaction:
  stable row id 不变
  row address 改变
  -> 更新 row_id_meta / RowIdIndex
  -> 索引主体可以复用

It can be illustrated as follows:

                 without stable row id

Index payload ------------------> RowAddress(F1, 42)
                                      |
                                      | compaction 后失效
                                      v
                                  RowAddress(F10, 8)


                 with stable row id

Index payload ---> StableRowId(10086)
                         |
                         v
                   RowIdIndex(version N)
                         |
                         v
                   RowAddress(F10, 8)

Actual Representation of RowIdIndex

RowIdIndex is not a manually created index, nor does it store an independent entry for each row.row_id -> row_address Record. It is a version-level memory index constructed by Lance based on fragment metadata when reading is required after stable row id is enabled, and cached in the metadata cache.

Construction entry point is:

rust/lance/src/dataset/rowids.rs
  get_row_id_index(dataset)
    -> 如果 manifest.uses_stable_row_ids()
    -> 从 metadata_cache 获取或构建 RowIdIndex

  load_row_id_index(dataset)
    -> 读取每个 fragment 的 row_id_meta
    -> 读取该 fragment 的 deletion vector
    -> 构造 FragmentRowIdIndex
    -> RowIdIndex::new(...)

Each fragment stores row_id_meta, which describes the "physical row order corresponding stable row id sequence" in this fragment:

Fragment 10
  row_id_meta:
    physical offset 0 -> stable row id 100
    physical offset 1 -> stable row id 101
    physical offset 2 -> stable row id 105
    physical offset 3 -> stable row id 106

  deletion vector:
    {1}

When constructing RowIdIndex, the deletion vector is applied:

offset 0 live     -> 100 -> RowAddress(F10, 0)
offset 1 deleted  -> 跳过
offset 2 live     -> 105 -> RowAddress(F10, 2)
offset 3 live     -> 106 -> RowAddress(F10, 3)

The core structure in the source code is:

pub struct RowIdIndex(
    RangeInclusiveMap<u64, (U64Segment, U64Segment)>
);

It can be understood as:

RowIdIndex
  key:
    这一段 row id 覆盖的范围

  value:
    row_id_segment:
      这一段实际存在的 row ids

    address_segment:
      与 row_id_segment 一一对齐的 physical row addresses

That is to say, a chunk of RowIdIndex is not a single mapping but a range of mappings:

coverage range:
  100..=106

row_id_segment:
  [100, 105, 106]

address_segment:
  [RowAddress(F10, 0), RowAddress(F10, 2), RowAddress(F10, 3)]

Queryrow_id = 105 is used, the process is:

1. 用 105 到 RangeInclusiveMap 中找到覆盖它的 chunk
2. 在 row_id_segment 中找 105 的位置
3. 假设位置是 1
4. 从 address_segment 取第 1 个地址
5. 得到 RowAddress(F10, 2)

the corresponding source code is:

rust/lance-table/src/rowids/index.rs
  RowIdIndex::get(row_id)
    -> self.0.get(&row_id)
    -> row_id_segment.position(row_id)
    -> address_segment.get(pos)
    -> RowAddress::from(address)

U64Segment is the compressed representation here. It selects different structures based on the shape of the row id sequence:

Range:
  连续有序，例如 100..200

RangeWithHoles:
  大体连续，但有少量空洞

RangeWithBitmap:
  大体连续，使用 bitmap 标记哪些位置存在

SortedArray:
  有序但比较稀疏

Array:
  无序序列

This explains an issue: with stable row id enabled, Lance does not write a metadata entry for every row. Under normal circumstances, consecutively inserted row ids can be represented using a range; after update, delete, and compaction, if gaps or out-of-order entries appear, it gradually degrades to representations like bitmap or array.

Source code entry:

docs/src/format/table/row_id_lineage.md
  Row Address vs Row ID
  stable row id behavior

docs/src/format/index/index.md
  Stable Row ID for Index

rust/lance/src/dataset/rowids.rs
  get_row_id_index
  load_row_id_index

rust/lance-table/src/rowids/index.rs
  RowIdIndex
  FragmentRowIdIndex

rust/lance-table/src/rowids/segment.rs
  U64Segment

The functions of Stable Row ID include:

the index does not need to be rewritten entirely due to physical address changes after compaction.
after an update, if the indexed column remains unchanged, it can reduce the range of index invalidation.
is suitable for scenarios with large tables that require long-term maintenance, frequent compaction, and high indexing costs.

But it also has a cost:

an additional stable row id -> row address mapping during queries.
Each fragment needs to maintain row_id_meta.
Deletion and updates will cause the row ID sequence to evolve from a continuous range into forms like holes, bitmap, array, etc.
This feature must be enabled when creating the dataset and cannot be added later to an existing, unused table.

Therefore, Stable Row ID is not suitable for being enabled by default. For small tables with one-time imports, infrequent updates, and acceptable index rebuilds, it may not be worth it. For datasets with high index construction costs, frequent compactions, and requiring long-term maintenance, it is more valuable.

A complete example

Suppose there is a Lance table:

schema:
  id: int64
  text: string
  vector: fixed_size_list<float32>[768]

indices:
  vector index on vector
  scalar index on id

Initial state:

Manifest v1
  Fragment 1: rows 0..999
  Fragment 2: rows 1000..1999

Vector index:
  fragment_bitmap = {1, 2}

Perform one update:

UPDATE table
SET text = 'new text'
WHERE id = 42;

Processing steps:

1. scan 找到 id = 42 的 row address
2. 写出更新后的新 row 到 Fragment 3
3. 给 Fragment 1 写 deletion file，标记旧 offset deleted
4. 提交 Operation::Update
5. affected_rows = {(Fragment 1, offset 42)}

If at the same time another transaction deletes id = 43:

T1 affected_rows = {(F1, 42)}
T2 affected_rows = {(F1, 43)}

Both fall into the same fragment, but the rows do not overlap. As long as there is no other data file rewrite, Lance has the opportunity to complete the rebase by merging the deletion vector.

will perform compaction:

Fragment 1:
  deleted rows 超过 threshold

Fragment 2:
  小于 target_rows_per_fragment

the planner will check:

这些 fragment 是否相邻？
它们是否有相同 index coverage？
删除比例是否值得单独 materialize？

if the final rewrite:

old:
  Fragment 1
  Fragment 2

new:
  Fragment 10

then the index must handle:

fragment_bitmap:
  {1, 2} -> {10}

row address:
  old addresses -> new addresses

if stable row id is enabled:

index payload 仍然指向 stable row id
RowIdIndex 更新 stable row id 到新 RowAddress 的映射

this example covers the relationship between Deletion Vector, Compaction, Index Remap, and Stable Row ID.

Summary

Lance's write pipeline is neither in-place updates of traditional databases nor simple append-only logs. It is a versioned columnar dataset write mechanism:

delete making old rows invisible through deletion vector.
update andmerge Expresses modifications by writing new rows/columns and adding tombstones to old rows.
The Deletion Vector simultaneously serves read filtering and row-level write conflict detection.
Paimon's DV leans more towards primary key table MOW and file-level read filtering; Paimon Data Evolution disables DV for column overlay semantics.
Compaction is responsible for merging small fragments, materialized deletion, and improving layout.
Compaction changes row addresses, which consequently triggers index remapping.
Stable Row ID decouples the index from physical row addresses by introducing stable logical row IDs.

The design pressure of Lance's write pipeline can be summarized as:

After the file rewrite, deletion, version, index, and row identity still need to represent the same batch of logical rows.

This content is automatically aggregated by InertiaRSS (RSS Reader) for reading reference only. Original from — Copyright belongs to the original author.