惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Mox的笔记库

细嗦下MLIR的环境搭建 | Mox的笔记库 博客重构:从Hexo到Astro | Mox的笔记库 2026PPoPP MLIR Tutorial学习 | Mox的笔记库 MacOS配置《明日方舟:终末地》 | Mox的笔记库 2025:向内生长 | Mox的笔记库 由mlir::ExecutionEngine引发的跨系统问题 | Mox的笔记库 WSL2配置Cuda-Tile环境记录(未完待续) | Mox的笔记库 Vibe Coding手搓项目记录 | Mox的笔记库 给Debian上包——以DuckDB为例 | Mox的笔记库 UCPD.sys事件存档 | Mox的笔记库 换新电脑之Mac mini M4从购买到配置 | Mox的笔记库 Mac配置MLX-C开发环境 | Mox的笔记库 RISC-V meets RDBMS——RISC-V架构上可运行数据库一览 | Mox的笔记库 DuckDB Sort实现调查 | Mox的笔记库 修复Redis在树莓派5上无法运行的问题 | Mox的笔记库 如何在MLIR中自定义类型并且输出运行 | Mox的笔记库 网站网络结构变更记录 | Mox的笔记库 EDBT25论文阅读:PhoebeDB——A Disk-Based RDBMS Kernel for High-Performance and Cost-Effective OLTP SIGMOD25论文阅读:BPF-DB:——A Kernel-Embedded Transactional Database Management System For eBPF Applications SIGMOD24文章阅读:Query Compilation Without Regrets | Mox的笔记库 论文阅读:Designing an Open Framework for Query Optimization and Compilation Apache Arrow Gandiva项目解析 | Mox的笔记库 VLDB24论文阅读:Cloud-Native Database Systems and Unikernels——Reimagining OS Abstractions for Modern Hardware NoisePage源码分析(未完待续) | Mox的笔记库 VLDB17论文阅读:Relaxed Operator Fusion for In-Memory Databases:Making Compilation, Vectorization, and Prefetching Work Together At Last 论文阅读:How not to structure your database-backed web applications——a study of performance bugs in the wild SIGMOD24阅读:ROME——Robust Query Optimization via Parallel Multi-Plan Execution 文章阅读:First Past the Post-Evaluating Query Optimization in MongoDB SIGMOD文章阅读:Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources VLDB23论文阅读:Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server SIGMOD22论文阅读:Efficient Massively Parallel Join Optimization for Large Queries VLDB论文阅读:Weaving Relations for Cache Performance VLDB22论文阅读:ConnectorX——Accelerating Data Loading From Databases to Dataframes 论文阅读:UniKraft-Fast, Specialized Unikernels the Easy Way 当DuckDB遇上RISC-V | Mox的笔记库 SIGMOD25论文阅读:An Elephant Under The Microscope——Analyzing The Interaction Of Optimizer Components In PostgreSQL 论文阅读:Compile-Time Analysis of Compiler Frameworks for Query Compilation VLDB23阅读:Bringing Compiling Databases to RISC Architectures LingoDB源码编译与分析 | Mox的笔记库 淦!MLIR输出Hello World不应该这么难! | Mox的笔记库 如何愉快的运行一个MLIR程序 | Mox的笔记库 2024:拥挤年代的想象与创造 | Mox的笔记库 如何给自己的博客添加MLIR和LLVM IR语法高亮 | Mox的笔记库 VLDB19-Parsing Gigabytes of JSON per Second论文阅读 CIDR25:Runtime-Extensible Parsers阅读 | Mox的笔记库 MLIR学习资料整理 | Mox的笔记库 SIGMOD24文章阅读:VeriTxn | Mox的笔记库 VLDB23文章阅读——Exploiting Cloud Object Storage for High-Performance Analytics VLDB24——OLAP on Modern Chiplet-Based Processors走马观花阅读 VLDB22:YeSQL文章阅读(已废弃) | Mox的笔记库 如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读 | Mox的笔记库 你好,世界! | Mox的笔记库 让系统研究更有意义:HarmonyOS NEXT的教训和经验——讲座回顾 | Mox的笔记库 UNSW 24T3 COMP9336上课记录 | Mox的笔记库 Velox开发环境配置踩坑记录 | Mox的笔记库 MLIR Toy Tutorial实践记录 | Mox的笔记库 论文阅读:Declarative Sub-Operators for Universal Data Processing LLVM-Kaleidoscope实操踩坑记录 | Mox的笔记库 2024年7月RSSHub开发体验 | Mox的笔记库 澳洲大学计算机硕士比较 | Mox的笔记库 论文阅读——CDUL:CLIP-Driven Unsupervised Learning for Multi-Label Image Classification 论批量快速添加图片与视频水印的事 | Mox的笔记库 CVPR2023-CLIP算法调研 | Mox的笔记库 基于元信息写入的服务器压力测试 | Mox的笔记库 MjAyMw==,希望,前进与平庸之道 | Mox的笔记库 家庭组网IPv6+Mesh折腾 | Mox的笔记库 code-server初体验 | Mox的笔记库 从Nginx到Caddy | Mox的笔记库 Hexo部署安装全流程回顾 | Mox的笔记库 RMM观察与初探 | Mox的笔记库 计算机网络课设——UDP/TCP/TLS Socket实验 | Mox的笔记库 JQuery的XSS初探 | Mox的笔记库 生产实习记录 | Mox的笔记库 Fedora-CoreOS配置与试用(2023年) | Mox的笔记库 Electron学习笔记 | Mox的笔记库 ServerSentEvent学习 | Mox的笔记库 报告翻译:容器云的安全挑战 | Mox的笔记库 Arch Linux迁移计划 | Mox的笔记库 Vagrant配置Metarget靶场环境 | Mox的笔记库 OpenAI-whisper折腾 | Mox的笔记库 202202,困惑,混乱与未曾设想之路 | Mox的笔记库 2022年Hack the box:Tier1免费区全解 | Mox的笔记库 Navidrome部署记录 | Mox的笔记库 长安杯2021-snake复现 | Mox的笔记库 报告概要翻译:OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING 从零开始的Django CVE-2022-28346复现 | Mox的笔记库 2022CISCN(西北区赛)-The shinning | Mox的笔记库 Docker+QEMU+Arm64(Ubuntu)+环境配置(2022版) | Mox的笔记库 Arch Linux运行树莓派系统(2022年) | Mox的笔记库 2022CISCN初赛-ez_usb-复盘WriteUp | Mox的笔记库 NodeMCU-MicroPython配置实录 | Mox的笔记库 Django事务使用 | Mox的笔记库 记录第一次EduSRC上报 | Mox的笔记库 Jetbrain问题应急处理 | Mox的笔记库 Celery5.2学习&配置 | Mox的笔记库 Waline部署记录 | Mox的笔记库 2021年12月 Vivo千镜杯回顾 | Mox的笔记库 Frida hook初次实战 | Mox的笔记库 Log4j2漏洞复现 | Mox的笔记库 Windows的WSL2+Docker初探 | Mox的笔记库
VLDB20论文阅读:Mainlining Databases——Supporting Fast Transactional Workloads on Universal Columnar Data File Formats
2025-06-21 · via Mox的笔记库

这篇文章主要关于NoisePage,是CMU Andy团队的工作

论文下载地址:https://db.cs.cmu.edu/papers/2020/p534-li.pdf

Git仓库:https://github.com/cmu-db/noisepage

主要涉及技术有Apache Arrow和LLVM

感觉好像是用列存使用PAX做HTAP?(读完以后发现其实并不是)

可以看看这位知乎老哥关于这篇文章的评析:noisepage paper分享:基于column-storage实现的事务存储引擎

Introduction

If an OLTP DBMS stores data in a format used by downstream applications, the export cost is just the cost of network transmission.

这就是要做HTAP的原因:OLTP需要进行分析的话,需要大量的网络带宽转到OLAP数据库,HTAP的话就可以就地解决

We leverage the natural cooling process of data, relaxing the columnar format for transactional throughput while the data is hot and transforming data back to the canonical format when write access becomes infrequent.

这里做了一个TradeOff,数据OLAP的时候是列状,进行OLTP的时候是行状(我猜应该是在内存中转为列状)

We evaluate the Arrow-based storage engine of NoisePage and demonstrate its OLTP competitiveness and orders of magnitudes faster data export to downstream Arrow applications.

All right,虽然但是,这并不是Apache Arrow的正确用法😂

BackGround

To better understand this issue, we measured the time it takes to extract data from a DBMS and load it into a Pandas program. We first create a 8 GB CSV file containing the TPC-H LINEITEM table (scalefactor 10, 60M tuples), and then load it into PostgreSQL (v10.6) and SAP HANA (v2.0).

image-20250621100358584

Recent work, however, has shown that column-stores can also support high-performance transactional processing [46, 50].

让我瞅瞅,是什么玩意,让列存也能适合TP类操作?

Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). 677–689.

Vishal Sikka, Franz Färber, Wolfgang Lehner, Sang Kyun Cha, Thomas Peh, and Christof Bornhövd. 2012. Efficient transaction processing in SAP HANA database: the end of a column store myth. In SIGMOD. 731–742.

哦,德国人的工作,那没事了😂🤩

Although Arrow’s design targets read-only analytical workloads, its alignment requirement and null bitmaps also benefit write-heavy workloads on fixed-length values.

“尽管Arrow的设计目标是仅读取的分析工作负载,但其对齐要求和无效位图也使固定长度值的写入工作负载也有益”(这么找补的么😅)

System Overview

The DBMS stores tuple deltas in transaction-local buffers instead of Arrow storage.

对的,因为Arrow的写入确实不太行

Transactions interact with Arrow exclusively through the Data Table API that abstracts away the underlying storage.

image-20250621145706313

依我看,这个Buffer应该是在内存当中

We now discuss the concurrency control mechanism of NoisePage on top of our storage architecture. We implement a variant of the Optimistic Concurrency Control protocol

使用乐观并发锁是实现事务

The system disallows write-write conflicts to avoid cascading rollbacks.

甚至读-读冲突都需要避免

he reader makes a copy is insufficient in this scenario as the DBMS can encounter the “A-B-A” problem. That is, an abort might occur between the checks and change the value of the tuple, but the reader cannot observe this through the version pointer.

“Reader无法通过版本指针观察这一点”,啊这

With NoisePage’s storage scheme, its transaction engine only reasons about tuple visibility using delta records and the version column. This abstraction comes at a cost for readers, as they are forced to materialize tuples early, which degrades scan performance.

emmm,这就有些难评了,大家选择Arrow和Parquet不就是为了高效读取,但增加一个并不是核心的Transaction却需要降低性能

Separating tuples and transactional metadata introduces another challenge: the system requires globally unique tuple identifiers to associate the two pieces that are not co-located. Physical identifiers (e.g., pointers) are ideal for performance but work poorly with column-stores because a tuple does not physically exist at a single location.

这个对于不能使用指针/迭代器的解释还不错

To pack both values into a single 64bit value, we use the C++11 keyword alignas to force the system to store all blocks at 1 MB boundaries within its address space.

C++alignas关键实现构建Blocks为1MB大小

The garbage collector (GC) [42, 43, 53, 56] is responsible for pruning version chains and freeing the associated memory. …… Because the DBMS stores versioning information in transactions’ buffers, the GC only examines transaction objects.

设计了一种GC机制用于解决MVCC的版本,且仅检查事务对象,检查过时的版本并删除

image-20250621192041388

这玩意好像是German-Style String

Log records identify tuples on disk using TupleSlots, even though pointers are invalid on reboot. The system maintains a mapping table between old tuple slots to their new physical locations in recovery mode.

在行存的时候,还是保留了指针

Block Transformation

Typical OLTP workloads modify only a small portion of a database at any given time, while the other parts of the database are mostly accessed by read-only queries

这个其实就很Trick:既然读多写少,那乐观锁解决些日常就足够了?😂😅

Therefore, for the hot portion, we can trade off read speed for write performance at only a small impact on the overall read performance of the DBMS. To achieve this, we modify the Arrow format for update performance in the hot portion. We detail these changes in this subsection

image-20250621195705085

将数据分为冷数据和热数据,以适应OLTP和OLAP的情况

A block in NoisePage can be in one of three states – hot, cooling, or frozen. Hot blocks are actively worked on by transactions, whereas frozen blocks are available for in-place scans in the Arrow format; cooling blocks are in the process of being transformed.

冷数据相当于全是Arrow,热数据相等于就是带有Transaction的Metadata,

image-20250621202943182

We now provide an overview of our transformation algorithm, also illustrated in Fig. 7. There are two components of our transformation pipeline, the access observer and block transformer, shown as boxes with dashed lines in Fig. 7.

下面还给了一套用于冷热切换的算法,转换是以块为单位

image-20250621204016979

NoisePage uses a hybrid two-phase approach that combines transactional tuple movement and raw operations under exclusive access, which is orchestrated with a novel multi-stage locking scheme that cooperates with GC to guard against races.

这一套冷热切换机制,被称为Block Transformer,是经过特别设计过的

此外,还用到了些算法,用于压缩块内空间

image-20250621205940555

External access

Flight enables our DBMS to send a large amount of cold data to the client in a zero-copy fashion

由于使用了Apache Arrow,也因此可以使用Apache Flight实现gRPC

Firstly, the DBMS loses control over access to its data as the client bypasses its CPU, which makes it difficult to lock the Arrow block to prevent updates.

如果使用RDMA的话,CPU则很难进行运算

Evaluation

接下来是紧张刺激的BenchMark环节(bushi

image-20250621212734239

运行TPC-C测验

image-20250621213605269

吞吐量还受到不同的处理模式,以及变长,固定长度数据的影响

We last evaluate the DBMS’s ability to export data to an external tool. We compare four methods from Sec. 5 in NoisePage: (1) clientside RDMA, (2) Arrow Flight RPC, (3) vectorized wire protocol from [47], and (4) PostgreSQL wire protocol. We implement (3) and (4) in NoisePage according to their specifications.

这里的47,感觉会有意思

Mark Raasveldt and Hannes Mühleisen. 2017. Don’t Hold My Data Hostage: A Case for Client Protocol Redesign. Proc. VLDB Endow. 10, 10 (June 2017), 1022–1033.

image-20250621214733904

RDMA performs slightly worse than Arrow Flight with a large number of hot blocks, because Flight has the materialized block in its CPU cache, whereas the NIC bypasses this cache when sending data.

可以看到Arrow Flight效果不错

Universal storage Format

Systems such as Apache Hive [6], Apache Impala [7], Dremio [12], and OmniSci [17] support data ingestion from universal storage formats to lower the data transformation cost.

大数据平台基本都是Apache Parquet,ORC,Arrow的地盘

两外还有Apache Kudu和Databricks’ Delta Lake engine

OLTP on Column-Stores

这里就不可避免的提到PAX,PAX是列存,但能有效HTAP

此外HYRISE,SPA HANA以及Single Store都有能力做到

还有就是Hyper

NoisePage的前生Pleoton也是这个类型

思考

我当时以为这文章涉及Query Compilation,但似乎并没有,想要这方面内容需要看Umbra

另外就是HTAP的概念,只能说在OLAP和OLTP大量研究的当下,就只剩HTAP还缺乏研究。NoisePage虽然支持Transaction,但却是乐观锁,无法应对海量操作,这其实并不是很能接受。

这应该也是Andy理念的中的Self-Driving DataBase的首次尝试,很有意思,但真要说好用,我认为是未必

如果还有什么比较中意的地方就是使用通用数据格式构建数据库——相比较于Arrow,我期待一种能支持OLTP的的通用格式

而这篇文章提到的Apache Flight感觉不错,效果能和RDMA比一比,可能有空会去看看