惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Mox的笔记库

细嗦下MLIR的环境搭建 | Mox的笔记库 博客重构:从Hexo到Astro | Mox的笔记库 2026PPoPP MLIR Tutorial学习 | Mox的笔记库 MacOS配置《明日方舟:终末地》 | Mox的笔记库 2025:向内生长 | Mox的笔记库 由mlir::ExecutionEngine引发的跨系统问题 | Mox的笔记库 WSL2配置Cuda-Tile环境记录(未完待续) | Mox的笔记库 Vibe Coding手搓项目记录 | Mox的笔记库 给Debian上包——以DuckDB为例 | Mox的笔记库 UCPD.sys事件存档 | Mox的笔记库 换新电脑之Mac mini M4从购买到配置 | Mox的笔记库 Mac配置MLX-C开发环境 | Mox的笔记库 RISC-V meets RDBMS——RISC-V架构上可运行数据库一览 | Mox的笔记库 DuckDB Sort实现调查 | Mox的笔记库 修复Redis在树莓派5上无法运行的问题 | Mox的笔记库 如何在MLIR中自定义类型并且输出运行 | Mox的笔记库 网站网络结构变更记录 | Mox的笔记库 EDBT25论文阅读:PhoebeDB——A Disk-Based RDBMS Kernel for High-Performance and Cost-Effective OLTP SIGMOD25论文阅读:BPF-DB:——A Kernel-Embedded Transactional Database Management System For eBPF Applications SIGMOD24文章阅读:Query Compilation Without Regrets | Mox的笔记库 论文阅读:Designing an Open Framework for Query Optimization and Compilation Apache Arrow Gandiva项目解析 | Mox的笔记库 VLDB24论文阅读:Cloud-Native Database Systems and Unikernels——Reimagining OS Abstractions for Modern Hardware NoisePage源码分析(未完待续) | Mox的笔记库 VLDB20论文阅读:Mainlining Databases——Supporting Fast Transactional Workloads on Universal Columnar Data File Formats 论文阅读:How not to structure your database-backed web applications——a study of performance bugs in the wild SIGMOD24阅读:ROME——Robust Query Optimization via Parallel Multi-Plan Execution 文章阅读:First Past the Post-Evaluating Query Optimization in MongoDB SIGMOD文章阅读:Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources VLDB23论文阅读:Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server SIGMOD22论文阅读:Efficient Massively Parallel Join Optimization for Large Queries VLDB论文阅读:Weaving Relations for Cache Performance VLDB22论文阅读:ConnectorX——Accelerating Data Loading From Databases to Dataframes 论文阅读:UniKraft-Fast, Specialized Unikernels the Easy Way 当DuckDB遇上RISC-V | Mox的笔记库 SIGMOD25论文阅读:An Elephant Under The Microscope——Analyzing The Interaction Of Optimizer Components In PostgreSQL 论文阅读:Compile-Time Analysis of Compiler Frameworks for Query Compilation VLDB23阅读:Bringing Compiling Databases to RISC Architectures LingoDB源码编译与分析 | Mox的笔记库 淦!MLIR输出Hello World不应该这么难! | Mox的笔记库 如何愉快的运行一个MLIR程序 | Mox的笔记库 2024:拥挤年代的想象与创造 | Mox的笔记库 如何给自己的博客添加MLIR和LLVM IR语法高亮 | Mox的笔记库 VLDB19-Parsing Gigabytes of JSON per Second论文阅读 CIDR25:Runtime-Extensible Parsers阅读 | Mox的笔记库 MLIR学习资料整理 | Mox的笔记库 SIGMOD24文章阅读:VeriTxn | Mox的笔记库 VLDB23文章阅读——Exploiting Cloud Object Storage for High-Performance Analytics VLDB24——OLAP on Modern Chiplet-Based Processors走马观花阅读 VLDB22:YeSQL文章阅读(已废弃) | Mox的笔记库 如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读 | Mox的笔记库 你好,世界! | Mox的笔记库 让系统研究更有意义:HarmonyOS NEXT的教训和经验——讲座回顾 | Mox的笔记库 UNSW 24T3 COMP9336上课记录 | Mox的笔记库 Velox开发环境配置踩坑记录 | Mox的笔记库 MLIR Toy Tutorial实践记录 | Mox的笔记库 论文阅读:Declarative Sub-Operators for Universal Data Processing LLVM-Kaleidoscope实操踩坑记录 | Mox的笔记库 2024年7月RSSHub开发体验 | Mox的笔记库 澳洲大学计算机硕士比较 | Mox的笔记库 论文阅读——CDUL:CLIP-Driven Unsupervised Learning for Multi-Label Image Classification 论批量快速添加图片与视频水印的事 | Mox的笔记库 CVPR2023-CLIP算法调研 | Mox的笔记库 基于元信息写入的服务器压力测试 | Mox的笔记库 MjAyMw==,希望,前进与平庸之道 | Mox的笔记库 家庭组网IPv6+Mesh折腾 | Mox的笔记库 code-server初体验 | Mox的笔记库 从Nginx到Caddy | Mox的笔记库 Hexo部署安装全流程回顾 | Mox的笔记库 RMM观察与初探 | Mox的笔记库 计算机网络课设——UDP/TCP/TLS Socket实验 | Mox的笔记库 JQuery的XSS初探 | Mox的笔记库 生产实习记录 | Mox的笔记库 Fedora-CoreOS配置与试用(2023年) | Mox的笔记库 Electron学习笔记 | Mox的笔记库 ServerSentEvent学习 | Mox的笔记库 报告翻译:容器云的安全挑战 | Mox的笔记库 Arch Linux迁移计划 | Mox的笔记库 Vagrant配置Metarget靶场环境 | Mox的笔记库 OpenAI-whisper折腾 | Mox的笔记库 202202,困惑,混乱与未曾设想之路 | Mox的笔记库 2022年Hack the box:Tier1免费区全解 | Mox的笔记库 Navidrome部署记录 | Mox的笔记库 长安杯2021-snake复现 | Mox的笔记库 报告概要翻译:OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING 从零开始的Django CVE-2022-28346复现 | Mox的笔记库 2022CISCN(西北区赛)-The shinning | Mox的笔记库 Docker+QEMU+Arm64(Ubuntu)+环境配置(2022版) | Mox的笔记库 Arch Linux运行树莓派系统(2022年) | Mox的笔记库 2022CISCN初赛-ez_usb-复盘WriteUp | Mox的笔记库 NodeMCU-MicroPython配置实录 | Mox的笔记库 Django事务使用 | Mox的笔记库 记录第一次EduSRC上报 | Mox的笔记库 Jetbrain问题应急处理 | Mox的笔记库 Celery5.2学习&配置 | Mox的笔记库 Waline部署记录 | Mox的笔记库 2021年12月 Vivo千镜杯回顾 | Mox的笔记库 Frida hook初次实战 | Mox的笔记库 Log4j2漏洞复现 | Mox的笔记库 Windows的WSL2+Docker初探 | Mox的笔记库
VLDB17论文阅读:Relaxed Operator Fusion for In-Memory Databases:Making Compilation, Vectorization, and Prefetching Work Together At Last
2025-05-23 · via Mox的笔记库

Andy团队出的,也算是非常经典的作品

Because disk accesses are no longer the principle bottleneck in such systems, the focus in designing query execution engines has shifted to optimizing CPU performance.

这个判断奠定了未来(哪怕是现在2025年)数据库开发的基调

文章测试使用的数据库为Peloton(已于2019年放弃)

Introduction

Given this trend, we anticipate that most future OLAP applications will use in-memory DBMSs. Because in-memory DBMSs are designed such that the bulk of the working dataset fits in memory,

就这么说吧,大量使用Memory是本世纪20年代后的数据库研究的主要特征😋(吐槽下,这点内存跟LLM比还是少很多的好吧)

A key challenge for both software prefetching and SIMD vectorization is that neither technique works well in a tuple-at-a-time model (i.e., Volcano [19]).

SIMD也是,数据库/大数据处理平台也就是从那时起开始重视这一块(要知道,SIMD最早适用于音视频处理加速的,但数据库大规模使用列存后基本都会用SIMD)

Until now, no DBMS has taken a holistic view on how to use all of the above techniques effectively together throughout the entire query plan.

这可太对了

Our experimental results show that ROF improves the performance of the DBMS for OLAP queries by up to 2.2X

两倍么,那还行

Background

To illustrate this approach, consider the plan shown in Fig. 2a for the TPC-H query in Fig. 1 annotated with its three pipelines P1, P2, and P3.

image-20250522223331051

Figure1是TPC-H 19(感兴趣的朋友可以手写下代码,我用LLM写的几个版本在我的开发板上运行,要么运行超时,要么Out-Of-Memory被系统强行kill掉),他这里拆了三条流水线,有意思

The DBMS’s optimizer generates one loop for every pipeline. Each loop iteration begins by reading a tuple from either a base table or an intermediate data structure (e.g., a hash-table used for a join).

这里使用的还是传统的火山模型

Combining the execution of multiple operators within a single loop iteration is known as operator fusion. Fusing together pipeline operators in this manner obviates the need to materialize tuple data to memory, and can instead pass tuple attributes through CPU registers, or the stack which will likely be in the cache.

可以在内存中实现算子融合

Returning to our example, Fig. 2c shows the colored blocks of code that correspond to the identically-colored pipeline in the query execution plan in Fig. 2a. The first block of code (P1) performs a scan on the Part table and the build-phase of the join. The second block (P2) scans the LineItem table, performs the join with Part, and computes the aggregate function.

Looks nice

A vectorized engine relies on the compiler to automatically detect loops that can be converted to SIMD, but modern compilers are only able to optimize simple loops involving computation over numeric columns.

这里在说自动向量化效果不行(但DuckDB后面采取的,就是向量自动化的这一种方式)

Fig. 3 shows pseudo-code for TPC-H Q19 using the vectorized processing model. First, P1 is almost identical to its counterpart in Fig. 2c, except tuples are read from Part in blocks. Moreover, since vectorized DBMSs employ late materialization, only the joinkey attribute and the corresponding tuple ID is stored in the hashtable. P2 reads blocks of tuples from LineItem and passes them through the predicate.

image-20250522230036930

这图也很棒,相当于增加了用于SIMD的Block用于Join

2.3讲的是数据预取(Prefetch)但这个方案更多依赖CPU的ISA,最多能做的就是给编译器进行Hint,所以这块就不贴了

Problem Overview

In the previous section, we described how query compilation, vectorized processing, and prefetching can optimize query execution for in-memory DBMSs. A natural question to ask is whether it is possible to combine these techniques into a single query processing model that embodies the benefits of them all?

将SIMD, Query Compilation和Prefetch结合起来到一个Query Processing Model?一个很灵魂的发问——文章下面给的回复是当时没有数据库做到

Part of the reason is that most systems that employ query compilation generate tuple-at-a-time code that avoids tuple materialization

火山模型基本都是每个Tuple进行LazyLoad,都会避免Tuple Materialization(如果用PushModel这些问题应该可以得到解决)

Hence, we contend that what is needed is a hybrid model that is able to tactfully materialize state to support both tuple-at-a-time processing, vectorized processing, and prefetching.

尝试做一个混合模型(Hybrid Model)验证下三混一行不行

image-20250523120722245

From the results shown in Fig. 4, we see that SIMD probes utilizing the vertical vectorization technique performs worse than tuple-ata-time probes with prefetching, even when the hash-table is cacheresident. This is because vectorization requires recomputing hash values on hash and key collisions.

用SIMD遇到了一些麻烦——有些HashKey需要重新计算,而且哈希表超过CPU缓存时,性能显著下降

其实这块有点没看懂,但结论是这三个方案合在一起需要进行trade off,中间流水线需要物化,有些数据选取需要动态选择

Example

这章主要讲文章的主要贡献ROF这个方案,这部分用的2b的图在前边

There are two key distinguishing characteristics between ROF and traditional vectorized processing; with the exception of the last stage iteration, ROF stages always deliver a full vector of input to the next stage in the pipeline, unlike vectorized processing that may deliver input vectors restricted by a selection vector.

emmm,没看懂😂他说的这个ROF是否就是Push-Model?或者应该就是存粹带有Pipeline的PushModel

image-20250523145711468

Fig. 2b shows a modified query plan using our ROF technique to introduce a single stage boundary after the first predicate ( 1). The ⌅ operator denotes an output vector that represents the boundary between stages.

boundary图标复制下来好像有问题,到是好奇Boundary是否需要什么特殊代码

Staging alone does not provide many benefits; however, it facilities two optimizations not possible otherwise: SIMD vectorization, and prefetching of non-cache-resident data.

OK,是为SIMD和Prefetch做准备

Vectorization

Given this, we greedily force a stage boundary on all SIMD operator outputs. The advantages of this are (1) SIMD operators always deliver a 100% full vector of only valid tuples, (2) it frees subsequent operators from performing validity checks, and (3) the DBMS can generate tight loops that are exclusively SIMD. We now describe how to implement a SIMD scan using these boundaries.

文章给了两种使用SIMD的方案,而最终选择了在Boundary上输出SIMD结果(原因是更容易优化)

The illustration in Fig. 6 is performing a SIMD scan over a 4-byte integer column attr_A and evaluating the predicate attr_A < 44. The DBMS first loads as many attribute values as possible (along with their tuple IDs) into the widest available SIMD register.

image-20250523162526682

使用bimask进行比较,然后predict,这里给的示例应该是AVX256

The permutation table stores an 8-byte value for each possible input bit-mask. A SIMD register storing n elements can produce 2n masks. With AVX2 256-bit registers operating on eight 4-byte integers, this results in 28 = 256 possible bit-masks. Thus, the size of the largest permutation table is at most 28 ⇥ 8 = 2 KB, small enough to fit in the CPU’s L1 cache.

这里感觉有点偷鸡😂那如果我是AVX512,有16个4-bytes,$2^{16} \times 16$=1MB,那L1缓存不就炸了?如果你考虑下RVV1.0那最长的2048bit的宽度那更不可能了

Prefetch

We use an open-addressing hash-table design with linear probing for hash and key-collisions. Previous work has shown this design to be robust and cache-friendly [35]. We use MurmurHash3 [9] as our primary hash function. This is differs from previous work [33] that prefers to use computationally simpler (and therefore faster) hash functions, such as multiply-add-shift.

重新设计了Hash Table,使用Murmurhash3算法,好处如下

  1. work on multiple different non-integer data types,
  2. provide a diverse hash distribution, and
  3. execute fast. MurmurHash3 satisfies these requirements and used in many popular systems

我在好多论文上都看到MurmurHash3

Our hash-table design is amenable to both software and hardware prefetching.

Since joins and aggregations operate on tuple vectors, software prefeteching will speed up the initial hash-table probe.

Secondly, the hardware prefetcher kicks in to accelerate the linear probing search sequence for hash collisions. By front-loading the status field and the key, the DBMS tries to ensure that at most one memory reference to a bucket is necessary to check both if the bucket is occupied and if the keys match.

相当于在Hash Table上进行Prefetch

其实我觉得这块是很Ticky的,你要控制一个CPU的缓存Prefetch,要么看编译器优化,要么修改硬件,而这一章节更多在讲构建Hash Table,感觉有点不太符合我的预期

Query Planning

A DBMS’s optimizer has to make two decisions per query when generating code with the ROF model: (1) whether to enable SIMD predicate evaluation and (2) whether to enable prefetching.

关于SIMD和Prefetch的控制

For operators that require random access to data structures whose size exceeds the cache size, the planner will install a stage boundary at the operator’s input to facilitate prefetching.

设计流水线边界的控制

这一块主要讲控制策略,可以是基于规则的启发式(有翻车可能),也可以是贪心(分两路,一路Prefetch,另一路不Prefetch然后对比寻找)

感觉这就是一种Query Optimization

Eexperiment Evaluation

大伙们喜闻乐见的跑分时间(这章节要比想象中的长)

For this evaluation, we implemented ROF in the Peloton in-memory DBMS [5]. Peloton is an HTAP DBMS that used interpretationbased execution engine for queries. We modified the system’s query planner to support JIT compilation using LLVM (v3.7)

上JIT了,以及好古早版本的LLVM

Although the TPC-H workload contains 22 queries, we select eight queries that cover all TPC-H choke-point query classifications [10] that vary from compute- to memory/join-intensive queries.

大伙们喜闻乐见的挑取几条Query比较😂文章引用的论文是TUM的TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark.

image-20250523173139609

效果来看,Q5,Q6有两倍的加速,Q19和Q14接近两倍,但Q1优化很少,我感觉有些怪——但就我分析来看,原因应该是只用SIMD优化Join,而Q1里面的Sum没用SIMD😅

image-20250523173200676

image-20250523174109087

后面的Case Study我就不发了,有空再看😂

We select the optimal staged plan for the eight TPC-H queries and fix the prefetch group size to 16. We then vary the size of all the output vectors in each plan from 64 to 256k tuples.

通过修改Batchsize看看Vector size对于数据处理的影响

image-20250523181231819

Q1 is also insensitive to the vector size, but for a slightly different reason. In this query, more than 98% of the tuples in the LineItem table qualify the predicate. As such, the primary bottleneck in the first pipeline is not in the SIMD scan, but in the aggregation. This aggregation does not benefit from larger vector sizes, which is why Q1 is not impacted by varying this configuration parameter.

这里说Q1对向量化不敏感,Aggregate要分配大锅——应该是对的

For Q13, the primary bottleneck in the query plan is the evaluation of the LIKE clause on the o_comment field of the Orders table. Since the DBMS cannot execute this predicate using SIMD, increasing the size of the predicate’s output vector does not help.

似乎对子句处理不太行?

However, larger vectors don’t improve performance for queries with joins. This is because modern CPUs support a limited number of outstanding memory references, making the query become memory-bound quicker than CPU-bound.

较大的向量块并不能提高连接查询的性能,因为内存足够快

Each pipeline in the query plan is executed using multiple threads that each modify only thread-local state. pipeline-breaking operators, a single coordinator thread coalesces data produced by each execution thread.

给了一个分Pipeline的方案,效果见Figure17

image-20250523191225582

We first note that the jump in execution time when moving from one thread to two threads for all the queries is due to the non-negligible bookkeeping and synchronization overhead that is necessary to support multi-threaded execution.

如果是多线程的话,线程切换有开销,这是不可避免的——

This corroborates our previous results in Sects. 5.3 to 5.5. Executing Q1 with multiple threads yields a consistent increase up to 20 cores at which point all CPUs are fully utilized and have saturated the memory bandwidth.

也就是说,多线程可以有效利用内存带宽

One final observation is the slight variance in execution times in Q3 and Q13 with more than 10 threads. This jitter is due to NUMA effects on our two CPU machine (10 cores per socket). …… Hence, CPUs in different NUMA regions experience different latency when accessing these counters stored in the global hash-table.

哦?还用了NUMA么,如果多线程运行Pipeline的话,NUMA也会成为影响因素。速度平均可以提升1.5倍左右

image-20250523202430493

这是一个和其他数据库比较的BenchMark,这个章节的剩余部分都是在分析每个Query为什么快,为什么慢——如果细看还是有很多细节值得关注的

The primary reason for this is because HyPer uses CRC32 for hashing, implemented using SIMD (SSE4), whereas Peloton uses the more computationally intensive MurmurHash3.

那这么说,是不是MurMurHash3能更好的利用SIMD?

结语

要是VLDB有时间检验奖的话,这篇文章在2027年绝对值得得到一个😍虽然有点小Bug,但Query Optimization,Query Compilation以及数据库对于SIMD的应用基本成为了2020年以后RDBMS的主要研究方向。

可能是最近在研究相关方向,这篇文章看起来就很舒服