VLDB20论文阅读:Mainlining Databases——Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

这篇文章主要关于NoisePage，是CMU Andy团队的工作

论文下载地址：https://db.cs.cmu.edu/papers/2020/p534-li.pdf

Git仓库：https://github.com/cmu-db/noisepage

主要涉及技术有Apache Arrow和LLVM

感觉好像是用列存使用PAX做HTAP？（读完以后发现其实并不是）

可以看看这位知乎老哥关于这篇文章的评析：noisepage paper分享：基于column-storage实现的事务存储引擎

Introduction

If an OLTP DBMS stores data in a format used by downstream applications, the export cost is just the cost of network transmission.

这就是要做HTAP的原因：OLTP需要进行分析的话，需要大量的网络带宽转到OLAP数据库，HTAP的话就可以就地解决

We leverage the natural cooling process of data, relaxing the columnar format for transactional throughput while the data is hot and transforming data back to the canonical format when write access becomes infrequent.

这里做了一个TradeOff，数据OLAP的时候是列状，进行OLTP的时候是行状（我猜应该是在内存中转为列状）

We evaluate the Arrow-based storage engine of NoisePage and demonstrate its OLTP competitiveness and orders of magnitudes faster data export to downstream Arrow applications.

All right，虽然但是，这并不是Apache Arrow的正确用法😂

BackGround

To better understand this issue, we measured the time it takes to extract data from a DBMS and load it into a Pandas program. We first create a 8 GB CSV file containing the TPC-H LINEITEM table (scalefactor 10, 60M tuples), and then load it into PostgreSQL (v10.6) and SAP HANA (v2.0).

Recent work, however, has shown that column-stores can also support high-performance transactional processing [46, 50].

让我瞅瞅，是什么玩意，让列存也能适合TP类操作？

Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). 677–689.

Vishal Sikka, Franz Färber, Wolfgang Lehner, Sang Kyun Cha, Thomas Peh, and Christof Bornhövd. 2012. Efficient transaction processing in SAP HANA database: the end of a column store myth. In SIGMOD. 731–742.

哦，德国人的工作，那没事了😂🤩

Although Arrow’s design targets read-only analytical workloads, its alignment requirement and null bitmaps also benefit write-heavy workloads on fixed-length values.

“尽管Arrow的设计目标是仅读取的分析工作负载，但其对齐要求和无效位图也使固定长度值的写入工作负载也有益”（这么找补的么😅）

System Overview

The DBMS stores tuple deltas in transaction-local buffers instead of Arrow storage.

对的，因为Arrow的写入确实不太行

Transactions interact with Arrow exclusively through the Data Table API that abstracts away the underlying storage.

依我看，这个Buffer应该是在内存当中

We now discuss the concurrency control mechanism of NoisePage on top of our storage architecture. We implement a variant of the Optimistic Concurrency Control protocol

使用乐观并发锁是实现事务

The system disallows write-write conflicts to avoid cascading rollbacks.

甚至读-读冲突都需要避免

he reader makes a copy is insufficient in this scenario as the DBMS can encounter the “A-B-A” problem. That is, an abort might occur between the checks and change the value of the tuple, but the reader cannot observe this through the version pointer.

“Reader无法通过版本指针观察这一点”，啊这

With NoisePage’s storage scheme, its transaction engine only reasons about tuple visibility using delta records and the version column. This abstraction comes at a cost for readers, as they are forced to materialize tuples early, which degrades scan performance.

emmm，这就有些难评了，大家选择Arrow和Parquet不就是为了高效读取，但增加一个并不是核心的Transaction却需要降低性能

Separating tuples and transactional metadata introduces another challenge: the system requires globally unique tuple identifiers to associate the two pieces that are not co-located. Physical identifiers (e.g., pointers) are ideal for performance but work poorly with column-stores because a tuple does not physically exist at a single location.

这个对于不能使用指针/迭代器的解释还不错

To pack both values into a single 64bit value, we use the C++11 keyword alignas to force the system to store all blocks at 1 MB boundaries within its address space.

用C++的alignas关键实现构建Blocks为1MB大小

The garbage collector (GC) [42, 43, 53, 56] is responsible for pruning version chains and freeing the associated memory. …… Because the DBMS stores versioning information in transactions’ buffers, the GC only examines transaction objects.

设计了一种GC机制用于解决MVCC的版本，且仅检查事务对象，检查过时的版本并删除

这玩意好像是German-Style String

Log records identify tuples on disk using TupleSlots, even though pointers are invalid on reboot. The system maintains a mapping table between old tuple slots to their new physical locations in recovery mode.

在行存的时候，还是保留了指针

Block Transformation

Typical OLTP workloads modify only a small portion of a database at any given time, while the other parts of the database are mostly accessed by read-only queries

这个其实就很Trick：既然读多写少，那乐观锁解决些日常就足够了？😂😅

Therefore, for the hot portion, we can trade off read speed for write performance at only a small impact on the overall read performance of the DBMS. To achieve this, we modify the Arrow format for update performance in the hot portion. We detail these changes in this subsection

将数据分为冷数据和热数据，以适应OLTP和OLAP的情况

A block in NoisePage can be in one of three states – hot, cooling, or frozen. Hot blocks are actively worked on by transactions, whereas frozen blocks are available for in-place scans in the Arrow format; cooling blocks are in the process of being transformed.

冷数据相当于全是Arrow，热数据相等于就是带有Transaction的Metadata，

We now provide an overview of our transformation algorithm, also illustrated in Fig. 7. There are two components of our transformation pipeline, the access observer and block transformer, shown as boxes with dashed lines in Fig. 7.

下面还给了一套用于冷热切换的算法，转换是以块为单位

NoisePage uses a hybrid two-phase approach that combines transactional tuple movement and raw operations under exclusive access, which is orchestrated with a novel multi-stage locking scheme that cooperates with GC to guard against races.

这一套冷热切换机制，被称为Block Transformer，是经过特别设计过的

此外，还用到了些算法，用于压缩块内空间

External access

Flight enables our DBMS to send a large amount of cold data to the client in a zero-copy fashion

由于使用了Apache Arrow，也因此可以使用Apache Flight实现gRPC

Firstly, the DBMS loses control over access to its data as the client bypasses its CPU, which makes it difficult to lock the Arrow block to prevent updates.

如果使用RDMA的话，CPU则很难进行运算

Evaluation

接下来是紧张刺激的BenchMark环节（bushi

运行TPC-C测验

吞吐量还受到不同的处理模式，以及变长，固定长度数据的影响

We last evaluate the DBMS’s ability to export data to an external tool. We compare four methods from Sec. 5 in NoisePage: (1) clientside RDMA, (2) Arrow Flight RPC, (3) vectorized wire protocol from [47], and (4) PostgreSQL wire protocol. We implement (3) and (4) in NoisePage according to their specifications.

这里的47，感觉会有意思

Mark Raasveldt and Hannes Mühleisen. 2017. Don’t Hold My Data Hostage: A Case for Client Protocol Redesign. Proc. VLDB Endow. 10, 10 (June 2017), 1022–1033.

RDMA performs slightly worse than Arrow Flight with a large number of hot blocks, because Flight has the materialized block in its CPU cache, whereas the NIC bypasses this cache when sending data.

可以看到Arrow Flight效果不错

Universal storage Format

Systems such as Apache Hive [6], Apache Impala [7], Dremio [12], and OmniSci [17] support data ingestion from universal storage formats to lower the data transformation cost.

大数据平台基本都是Apache Parquet，ORC，Arrow的地盘

两外还有Apache Kudu和Databricks’ Delta Lake engine

OLTP on Column-Stores

这里就不可避免的提到PAX，PAX是列存，但能有效HTAP

此外HYRISE，SPA HANA以及Single Store都有能力做到

还有就是Hyper

NoisePage的前生Pleoton也是这个类型

思考

我当时以为这文章涉及Query Compilation，但似乎并没有，想要这方面内容需要看Umbra

另外就是HTAP的概念，只能说在OLAP和OLTP大量研究的当下，就只剩HTAP还缺乏研究。NoisePage虽然支持Transaction，但却是乐观锁，无法应对海量操作，这其实并不是很能接受。

这应该也是Andy理念的中的Self-Driving DataBase的首次尝试，很有意思，但真要说好用，我认为是未必

如果还有什么比较中意的地方就是使用通用数据格式构建数据库——相比较于Arrow，我期待一种能支持OLTP的的通用格式

而这篇文章提到的Apache Flight感觉不错，效果能和RDMA比一比，可能有空会去看看

推荐订阅源

Mox的笔记库