











Lance's write path involves file layout, version commit, deletion marking, index maintenance, and compaction. Unlike traditional databases, Lance does not modify data directly on the original file but generates new table versions by adding new files and updating metadata.
This article discusses several implementation issues:
delete、update、merge insert how they are applied to files and metadata. Lance's basic write form is: writing to a new data file or a deletion file, and then committing the new version through transactions and manifests.
Delete:
不立刻重写 data file
-> 记录 fragment 内被删除的 row offset
-> 读时过滤 deletion vector
Update / Merge:
写出更新后的新 rows 或新 columns
-> 对旧 rows 写 deletion vector
-> 提交 Operation::Update
Compaction:
读取旧 fragments 的 live rows
-> 写成新的 fragments
-> 旧 row address 可能失效
-> 索引需要 remap 或重建
Stable Row ID:
让索引指向稳定逻辑行 ID
-> compaction 后只需要维护 row_id -> row_address 映射
-> 降低索引 remap 成本
In Lance's persistent metadata, it is usually called DeletionFile, and internally it uses DeletionVector:
DeletionVector:
内存中的删除集合语义
DeletionFile:
DeletionVector 在表目录下的持久化文件引用
From the LanceDB API perspective, the user calls:
table.add(...)
table.delete("id = 42")
table.update(where="id = 42", values={"text": "new"})
table.merge_insert("id").when_matched_update_all().when_not_matched_insert_all()
table.optimize.compact_files()
But what Lance sees at the bottom is not "modifying a few rows in a table," but a single dataset state transition:
读取当前 manifest
-> 执行 scan / join / filter
-> 写新的 data files / deletion files / index files
-> 构造 Operation
-> 写 transaction
-> 提交新的 manifest version
A version's manifest describes the complete state of the current table:
Manifest version N
schema
fragments
indices
version metadata
Fragment 1
data files
deletion file
row id metadata
transaction describes the changes to table state for this commit. Therefore, creating indexes, compacting files, and updating configurations can result in a new version even if the number of rows hasn't changed.
Delete does not immediately rewrite the data file but marks the targeted rows as deleted.
Simplified linkage as follows:
delete where id = 42
-> scan + predicate 找到命中 rows
-> 捕获 row address
-> 按 fragment 聚合成 local row offsets
-> 更新 fragment.deletion_file
-> Operation::Delete
-> CommitBuilder.with_affected_rows(...)
-> 提交新 manifest
Assuming Fragment 7 has 1000 rows:
Fragment 7
data file: 1000 physical rows
deletion file: {3, 19, 42}
读取 Fragment 7 时:
offset 3、19、42 被过滤
其他行仍然可见
In the source code, Fragment the metadata contains an optional deletion_file field, semantically meaning "the local row offsets deleted within this fragment." DeletionFileType Currently, there are two types:
Array:
适合较稀疏的删除集合
Bitmap:
适合较密集的删除集合
source code entry:
rust/lance-table/src/format/fragment.rs
DeletionFile
DeletionFileType
Fragment.deletion_file
rust/lance-table/src/io/deletion.rs
write_deletion_file
read_deletion_file
rust/lance/src/dataset/write/delete.rs
apply_deletions
DeleteBuilder
The function of this design is:
The corresponding cost is: loading the deletion vector for the read path and skipping tombstoned rows during scanning.
A common implementation of Update is to write the new data after the update, and then tombstone the old row.
OrdinaryupdateThe main chain is approachingRewriteRows:
update set text = 'new' where id = 42
-> scan where 条件命中的 rows,并带上 row id / row address
-> 对 batch 应用更新表达式
-> write_fragments_internal 写出新的 fragments
-> 对旧 row address 应用 deletion vector
-> Operation::Update { update_mode: RewriteRows }
-> CommitBuilder.with_affected_rows(...)
as an example of updating only 1 column in a 10-column list:
更新前:
Fragment 1
row offset 42 = (c1, c2, c3, ..., c10)
UPDATE SET c3 = c3_new WHERE id = 42
更新后:
Fragment 1
deletion file 标记 offset 42 deleted
Fragment 9
新写入 row = (c1, c2, c3_new, ..., c10)
RewriteRows does not rewrite the entire old fragment, but instead writes out the matched rows as new rows. For each updated row, it writes out the complete row.
Source code entry:
rust/lance/src/dataset/write/update.rs
UpdateJob::execute_impl
scanner.with_row_id()
write_fragments_internal(...)
apply_deletions(...)
Operation::Update { update_mode: RewriteRows }
Lance also has RewriteColumns mode, mainly appearing in merge/update scenarios of some schemas. It targets updates with "large rows, few columns," but increases the maintenance costs of fragments, column files, index coverage, and conflict detection.
LanceDB's merge_insert can be used for upsert:
(
table.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(source)
)
hereid is the join key for matching source and target, not a strong constraint primary key in the database. The simplified process of
is:
source 与 target 按 key join
matched:
更新 target rows
not matched:
插入 source rows
not matched by source:
keep 或 delete
on the full schema path, the merge insert process is similar to a regular update:
matched rows:
写出更新后的完整 rows 到新 fragments
对旧 target rows 写 deletion vector
not matched rows:
作为新 rows 写入新 fragments
commit:
Operation::Update { update_mode: RewriteRows }
on the partial schema path, it will follow RewriteColumns:
source 只包含部分列
-> update_fragments(...)
-> Operation::Update { update_mode: RewriteColumns, fields_modified }
Source code entry:
rust/lance/src/dataset/write/merge_insert.rs
MergeInsertBuilder
update_fragments
Operation::Update { update_mode: RewriteRows | RewriteColumns }
The Deletion Vector not only filters out deleted rows during reads but also allows Lance to reduce some write conflicts from "fragment-level conflicts" to "row-level conflicts".
First look at the problems when there is no Deletion Vector / affected rows:
T1:
delete row (Fragment 1, offset 10)
T2:
delete row (Fragment 1, offset 20)
If you only look at the fragment level, both transactions modified Fragment 1, so they are easily identified as conflicting. However, they actually deleted different rows, which can be merged:
T1 deletion vector: {10}
T2 deletion vector: {20}
rebase 后:
Fragment 1 deletion vector: {10, 20}
Lance passes the matched row address to the commit layer during delete/update submissions:
CommitBuilder.with_affected_rows(RowAddrTreeMap)
This allows conflict detection to determine:
两个并发事务是否真的修改了同一批 rows?
If it's just modifying different rows of the same fragment, and another transaction hasn't changed the data files but only the deletion file, then the current transaction has a chance to rebase. That is, it can write a merged deletion vector based on the new fragment deletion file.
Typical case:
可 rebase:
T1 delete F1.offset 10
T2 delete F1.offset 20
-> affected_rows 不重叠
-> 合并 deletion vector
不可 rebase:
T1 update F1.offset 10
T2 delete F1.offset 10
-> affected_rows 重叠
-> 语义冲突
不可简单 rebase:
T1 delete F1.offset 10
T2 compaction/rewrite F1
-> fragment 的 data files 被重写
-> row address / fragment 状态发生大范围变化
Source code entry:
rust/lance/src/dataset/write/commit.rs
CommitBuilder.with_affected_rows
rust/lance/src/io/commit/conflict_resolver.rs
TransactionRebase
check_delete_txn
check_update_txn
Therefore, the Deletion Vector does not eliminate all write conflicts. It provides the expressiveness of row-level affected rows, allowing Lance to avoid some fragment-level false conflicts.
For datasets containing a large number of samples, updates, deletions, and merges often only affect a small number of rows. If concurrency control can only be done at the fragment level, it will judge many non-overlapping row-level modifications as conflicts.
Lance and Apache Paimon both have Deletion Vectors, but the problems they aim to solve are not entirely the same.
| Dimensions | Lance | Apache Paimon |
|---|---|---|
| Main Data Models | Columnar datasets for Arrow / Lance fragments | Supports lakehouse tables, primary key tables, LSM, bucket, snapshot |
| DV Granularity | Local row offset within fragment | Row position within data file |
| Typical Use Cases | Delete, update, merge, conflict detection, read-time filtering, compaction materialize deletions | Avoid read-time merge and generate DV files during write in primary key table MOW mode |
| and its relationship with updates | the common path for updates is to write new rows + tombstone old rows | the primary key table relies on LSM to find old data and generate DV |
| its relationship with column-level evolution | Lance has multiple data files and row id metadata for fragments, and still maintains them around fragment/row address after updates | Paimon Data Evolution uses column overlay, similar to first row id where file reads are merged |
| its relationship with conflict detection | affected_rows it allows concurrent delete/update to perform row-level rebase |
is more inclined towards primary key table write paths and file-level read filtering |
Paimon's DV (deletion vector) document semantics are clear: it records the deleted row positions in a data file and filters these rows when reading the file. Paimon's Merge On Write mode relies on LSM (Log-Structured Merge-tree), which allows querying the primary key during the write phase to generate the corresponding data file's deletion vector, thus eliminating the need for a full merge during reads.
Paimon's Data Evolution table takes a different approach: it writes only the updated columns to a new file while keeping the original data file unchanged, and during reads, it merges multiple files with the same first row id into complete rows. Therefore, Paimon Data Evolution explicitly requires disabling deletion vectors and does not currently support regular Delete / Update statements.
Below is a simplified example of combining Data Evolution with DV:
base file:
firstRowId = 100
columns: id, a, b, c
update file:
firstRowId = 100
columns: b_new
读时:
base file + update file
-> 按 firstRowId 对齐
-> 得到完整行
If file-level deletion vector is introduced:
删除 base file 的某一行:
a / c 也被隐藏
但 b_new 的 overlay 文件如何处理?
删除 update file 的某一行:
b_new 不可见
但 base file 的旧 b 是否应该恢复?
If both mechanisms are supported simultaneously, the read and write paths need to handle both "row deletion" and "column overlay merge". Paimon separates Data Evolution and Deletion Vector to avoid mutual interference between these two semantic sets.
Lance's typical update path is:
新 rows / 新 columns 写入新 fragment 或新 data files
旧 rows 通过 deletion vector tombstone
manifest 统一描述当前版本的 fragment 状态
Therefore, Lance's Deletion Vector is not just a read filtering tool but also participates in rebasing and conflict detection during concurrent writes.
Deletion Vector reduces the write amplification of delete/update but leaves two issues:
Compaction is used to handle these layout degradation issues.
Lance's compaction is not simply selecting based on file size, but mainly looks at the number of lines in the fragment and the deletion ratio:
if deletion_percentage > materialize_deletions_threshold:
选中这个 fragment
目的:把 deletion vector 物化掉,只写 live rows
else if physical_rows < target_rows_per_fragment:
选中这个 fragment
目的:和相邻的小 fragments 合并成更大的 fragment
else:
不参与本轮 compaction
Default configuration:
target_rows_per_fragment = 1024 * 1024
materialize_deletions_threshold = 0.1
The corresponding action is:
Source code entry:
rust/lance/src/dataset/optimize.rs
CompactionOptions
plan_compaction
CompactionCandidacy::CompactItself
CompactionCandidacy::CompactWithNeighbors
There is an index constraint here: compaction will not mix "fragments covered by an index" and "fragments not covered by this index" in the same rewrite group.
The reason is that Lance's index metadata includes fragment_bitmap, which describes which fragments the index covers. If a rewrite group mixes indexed and unindexed fragments, the new fragment created after compaction cannot be clearly classified as indexed or unindexed.
Therefore, the compaction planner will categorize candidate fragments into bins based on index coverage:
F1 indexed by vector_index
F2 indexed by vector_index
F3 not indexed
允许:
compact(F1, F2) -> F10
不允许:
compact(F2, F3) -> F11
The essence of compaction is rewrite:
old fragments:
F1, F2
new fragments:
F10
If the index stores row addresses, after compaction, the addresses in the index need to be maintained.
Lance's row address is composed of fragment ID and row offset:
row_address = (fragment_id, row_offset)
Before compaction:
F1.offset 0
F1.offset 1
F2.offset 0
After compaction:
F10.offset 0
F10.offset 1
F10.offset 2
Logically, they are still the same batch of live rows, but the physical addresses have changed. The row addresses stored in the index must be updated accordingly, otherwise the index will point to an old fragment.
Lance has several ways to handle the index after compaction:
1. 同步 remap index
compaction 时生成 old row address -> new row address 映射
重写受影响的 index files
2. defer_index_remap
compaction 时不立即重写所有索引
建立 fragment reuse index
查询时或后续维护时再做 remap
3. stable row id
索引不再直接依赖易变的 physical row address
而是指向稳定 row id
Apply in transactionsOperation::Rewrite, Lance processes two types of metadata:
fragments:
old fragments 从 manifest 移除
new fragments 加入 manifest
indices:
更新 fragment_bitmap
必要时替换 index uuid 和 index files
Source code entry:
rust/lance/src/dataset/transaction.rs
Operation::Rewrite
handle_rewrite_fragments
recalculate_fragment_bitmap
handle_rewrite_indices
rust/lance/src/dataset/optimize/remapping.rs
fragment reuse index
deferred index remap
The maintenance methods for different indexes vary. Whether remapping is possible depends on whether the index internally stores details that can map from old row addresses to new row addresses.
They can be distinguished based on the content recorded internally in the index:
较容易 remap:
能逐条或逐 segment 映射到 row address 的索引
较难 remap:
内部结构强依赖训练结果、聚类结果、posting layout 或 block layout 的索引
For example, vector indexes are not just a key -> row address mapping. Indexes like IVF internally have structures such as cluster centers, partitions, vector lists, and row ID lists. Although the vector values remain unchanged after compaction, the row addresses change, and the row ID/address payloads within the index need to be consistently updated. If the index format does not provide cheap, local, and reliable remapping capabilities, rewriting or relying on fragment reuse index for deferred processing is the only option.
Therefore, the challenge of compaction is not just rewriting data files:
After rewriting data files, the index must still be able to locate the same batch of logical rows.
Stable Row ID assigns a stable ID to each logical row, making the index no longer directly dependent on the changing row address.
Without stable row ID:
row id ~= row address
索引命中:
vector index -> row address -> 回表
compaction:
row address 改变
-> index payload 需要 remap
After enabling stable row ID:
row id = 稳定逻辑行 ID
row address = 当前物理位置
索引命中:
vector index -> stable row id
-> RowIdIndex 查 row id 当前对应的 row address
-> 回表
compaction:
stable row id 不变
row address 改变
-> 更新 row_id_meta / RowIdIndex
-> 索引主体可以复用
It can be illustrated as follows:
without stable row id
Index payload ------------------> RowAddress(F1, 42)
|
| compaction 后失效
v
RowAddress(F10, 8)
with stable row id
Index payload ---> StableRowId(10086)
|
v
RowIdIndex(version N)
|
v
RowAddress(F10, 8)
RowIdIndex is not a manually created index, nor does it store an independent entry for each row.row_id -> row_address Record. It is a version-level memory index constructed by Lance based on fragment metadata when reading is required after stable row id is enabled, and cached in the metadata cache.
Construction entry point is:
rust/lance/src/dataset/rowids.rs
get_row_id_index(dataset)
-> 如果 manifest.uses_stable_row_ids()
-> 从 metadata_cache 获取或构建 RowIdIndex
load_row_id_index(dataset)
-> 读取每个 fragment 的 row_id_meta
-> 读取该 fragment 的 deletion vector
-> 构造 FragmentRowIdIndex
-> RowIdIndex::new(...)
Each fragment stores row_id_meta, which describes the "physical row order corresponding stable row id sequence" in this fragment:
Fragment 10
row_id_meta:
physical offset 0 -> stable row id 100
physical offset 1 -> stable row id 101
physical offset 2 -> stable row id 105
physical offset 3 -> stable row id 106
deletion vector:
{1}
When constructing RowIdIndex, the deletion vector is applied:
offset 0 live -> 100 -> RowAddress(F10, 0)
offset 1 deleted -> 跳过
offset 2 live -> 105 -> RowAddress(F10, 2)
offset 3 live -> 106 -> RowAddress(F10, 3)
The core structure in the source code is:
pub struct RowIdIndex(
RangeInclusiveMap<u64, (U64Segment, U64Segment)>
);
It can be understood as:
RowIdIndex
key:
这一段 row id 覆盖的范围
value:
row_id_segment:
这一段实际存在的 row ids
address_segment:
与 row_id_segment 一一对齐的 physical row addresses
That is to say, a chunk of RowIdIndex is not a single mapping but a range of mappings:
coverage range:
100..=106
row_id_segment:
[100, 105, 106]
address_segment:
[RowAddress(F10, 0), RowAddress(F10, 2), RowAddress(F10, 3)]
Queryrow_id = 105 is used, the process is:
1. 用 105 到 RangeInclusiveMap 中找到覆盖它的 chunk
2. 在 row_id_segment 中找 105 的位置
3. 假设位置是 1
4. 从 address_segment 取第 1 个地址
5. 得到 RowAddress(F10, 2)
the corresponding source code is:
rust/lance-table/src/rowids/index.rs
RowIdIndex::get(row_id)
-> self.0.get(&row_id)
-> row_id_segment.position(row_id)
-> address_segment.get(pos)
-> RowAddress::from(address)
U64Segment is the compressed representation here. It selects different structures based on the shape of the row id sequence:
Range:
连续有序,例如 100..200
RangeWithHoles:
大体连续,但有少量空洞
RangeWithBitmap:
大体连续,使用 bitmap 标记哪些位置存在
SortedArray:
有序但比较稀疏
Array:
无序序列
This explains an issue: with stable row id enabled, Lance does not write a metadata entry for every row. Under normal circumstances, consecutively inserted row ids can be represented using a range; after update, delete, and compaction, if gaps or out-of-order entries appear, it gradually degrades to representations like bitmap or array.
Source code entry:
docs/src/format/table/row_id_lineage.md
Row Address vs Row ID
stable row id behavior
docs/src/format/index/index.md
Stable Row ID for Index
rust/lance/src/dataset/rowids.rs
get_row_id_index
load_row_id_index
rust/lance-table/src/rowids/index.rs
RowIdIndex
FragmentRowIdIndex
rust/lance-table/src/rowids/segment.rs
U64Segment
The functions of Stable Row ID include:
But it also has a cost:
stable row id -> row address mapping during queries. row_id_meta. Therefore, Stable Row ID is not suitable for being enabled by default. For small tables with one-time imports, infrequent updates, and acceptable index rebuilds, it may not be worth it. For datasets with high index construction costs, frequent compactions, and requiring long-term maintenance, it is more valuable.
Suppose there is a Lance table:
schema:
id: int64
text: string
vector: fixed_size_list<float32>[768]
indices:
vector index on vector
scalar index on id
Initial state:
Manifest v1
Fragment 1: rows 0..999
Fragment 2: rows 1000..1999
Vector index:
fragment_bitmap = {1, 2}
Perform one update:
UPDATE table
SET text = 'new text'
WHERE id = 42;
Processing steps:
1. scan 找到 id = 42 的 row address
2. 写出更新后的新 row 到 Fragment 3
3. 给 Fragment 1 写 deletion file,标记旧 offset deleted
4. 提交 Operation::Update
5. affected_rows = {(Fragment 1, offset 42)}
If at the same time another transaction deletes id = 43:
T1 affected_rows = {(F1, 42)}
T2 affected_rows = {(F1, 43)}
Both fall into the same fragment, but the rows do not overlap. As long as there is no other data file rewrite, Lance has the opportunity to complete the rebase by merging the deletion vector.
will perform compaction:
Fragment 1:
deleted rows 超过 threshold
Fragment 2:
小于 target_rows_per_fragment
the planner will check:
这些 fragment 是否相邻?
它们是否有相同 index coverage?
删除比例是否值得单独 materialize?
if the final rewrite:
old:
Fragment 1
Fragment 2
new:
Fragment 10
then the index must handle:
fragment_bitmap:
{1, 2} -> {10}
row address:
old addresses -> new addresses
if stable row id is enabled:
index payload 仍然指向 stable row id
RowIdIndex 更新 stable row id 到新 RowAddress 的映射
this example covers the relationship between Deletion Vector, Compaction, Index Remap, and Stable Row ID.
Lance's write pipeline is neither in-place updates of traditional databases nor simple append-only logs. It is a versioned columnar dataset write mechanism:
delete making old rows invisible through deletion vector. update andmerge Expresses modifications by writing new rows/columns and adding tombstones to old rows. The design pressure of Lance's write pipeline can be summarized as:
After the file rewrite, deletion, version, index, and row identity still need to represent the same batch of logical rows.
This content is automatically aggregated by InertiaRSS (RSS Reader) for reading reference only. Original from — Copyright belongs to the original author.