惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Mox的笔记库

细嗦下MLIR的环境搭建 | Mox的笔记库 博客重构:从Hexo到Astro | Mox的笔记库 2026PPoPP MLIR Tutorial学习 | Mox的笔记库 MacOS配置《明日方舟:终末地》 | Mox的笔记库 2025:向内生长 | Mox的笔记库 由mlir::ExecutionEngine引发的跨系统问题 | Mox的笔记库 WSL2配置Cuda-Tile环境记录(未完待续) | Mox的笔记库 Vibe Coding手搓项目记录 | Mox的笔记库 给Debian上包——以DuckDB为例 | Mox的笔记库 UCPD.sys事件存档 | Mox的笔记库 换新电脑之Mac mini M4从购买到配置 | Mox的笔记库 Mac配置MLX-C开发环境 | Mox的笔记库 RISC-V meets RDBMS——RISC-V架构上可运行数据库一览 | Mox的笔记库 DuckDB Sort实现调查 | Mox的笔记库 修复Redis在树莓派5上无法运行的问题 | Mox的笔记库 如何在MLIR中自定义类型并且输出运行 | Mox的笔记库 网站网络结构变更记录 | Mox的笔记库 EDBT25论文阅读:PhoebeDB——A Disk-Based RDBMS Kernel for High-Performance and Cost-Effective OLTP SIGMOD25论文阅读:BPF-DB:——A Kernel-Embedded Transactional Database Management System For eBPF Applications SIGMOD24文章阅读:Query Compilation Without Regrets | Mox的笔记库 论文阅读:Designing an Open Framework for Query Optimization and Compilation Apache Arrow Gandiva项目解析 | Mox的笔记库 VLDB24论文阅读:Cloud-Native Database Systems and Unikernels——Reimagining OS Abstractions for Modern Hardware NoisePage源码分析(未完待续) | Mox的笔记库 VLDB20论文阅读:Mainlining Databases——Supporting Fast Transactional Workloads on Universal Columnar Data File Formats VLDB17论文阅读:Relaxed Operator Fusion for In-Memory Databases:Making Compilation, Vectorization, and Prefetching Work Together At Last 论文阅读:How not to structure your database-backed web applications——a study of performance bugs in the wild SIGMOD24阅读:ROME——Robust Query Optimization via Parallel Multi-Plan Execution 文章阅读:First Past the Post-Evaluating Query Optimization in MongoDB SIGMOD文章阅读:Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources VLDB23论文阅读:Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server SIGMOD22论文阅读:Efficient Massively Parallel Join Optimization for Large Queries VLDB论文阅读:Weaving Relations for Cache Performance VLDB22论文阅读:ConnectorX——Accelerating Data Loading From Databases to Dataframes 论文阅读:UniKraft-Fast, Specialized Unikernels the Easy Way 当DuckDB遇上RISC-V | Mox的笔记库 SIGMOD25论文阅读:An Elephant Under The Microscope——Analyzing The Interaction Of Optimizer Components In PostgreSQL 论文阅读:Compile-Time Analysis of Compiler Frameworks for Query Compilation LingoDB源码编译与分析 | Mox的笔记库 淦!MLIR输出Hello World不应该这么难! | Mox的笔记库 如何愉快的运行一个MLIR程序 | Mox的笔记库 2024:拥挤年代的想象与创造 | Mox的笔记库 如何给自己的博客添加MLIR和LLVM IR语法高亮 | Mox的笔记库 VLDB19-Parsing Gigabytes of JSON per Second论文阅读 CIDR25:Runtime-Extensible Parsers阅读 | Mox的笔记库 MLIR学习资料整理 | Mox的笔记库 SIGMOD24文章阅读:VeriTxn | Mox的笔记库 VLDB23文章阅读——Exploiting Cloud Object Storage for High-Performance Analytics VLDB24——OLAP on Modern Chiplet-Based Processors走马观花阅读 VLDB22:YeSQL文章阅读(已废弃) | Mox的笔记库 如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读 | Mox的笔记库 你好,世界! | Mox的笔记库 让系统研究更有意义:HarmonyOS NEXT的教训和经验——讲座回顾 | Mox的笔记库 UNSW 24T3 COMP9336上课记录 | Mox的笔记库 Velox开发环境配置踩坑记录 | Mox的笔记库 MLIR Toy Tutorial实践记录 | Mox的笔记库 论文阅读:Declarative Sub-Operators for Universal Data Processing LLVM-Kaleidoscope实操踩坑记录 | Mox的笔记库 2024年7月RSSHub开发体验 | Mox的笔记库 澳洲大学计算机硕士比较 | Mox的笔记库 论文阅读——CDUL:CLIP-Driven Unsupervised Learning for Multi-Label Image Classification 论批量快速添加图片与视频水印的事 | Mox的笔记库 CVPR2023-CLIP算法调研 | Mox的笔记库 基于元信息写入的服务器压力测试 | Mox的笔记库 MjAyMw==,希望,前进与平庸之道 | Mox的笔记库 家庭组网IPv6+Mesh折腾 | Mox的笔记库 code-server初体验 | Mox的笔记库 从Nginx到Caddy | Mox的笔记库 Hexo部署安装全流程回顾 | Mox的笔记库 RMM观察与初探 | Mox的笔记库 计算机网络课设——UDP/TCP/TLS Socket实验 | Mox的笔记库 JQuery的XSS初探 | Mox的笔记库 生产实习记录 | Mox的笔记库 Fedora-CoreOS配置与试用(2023年) | Mox的笔记库 Electron学习笔记 | Mox的笔记库 ServerSentEvent学习 | Mox的笔记库 报告翻译:容器云的安全挑战 | Mox的笔记库 Arch Linux迁移计划 | Mox的笔记库 Vagrant配置Metarget靶场环境 | Mox的笔记库 OpenAI-whisper折腾 | Mox的笔记库 202202,困惑,混乱与未曾设想之路 | Mox的笔记库 2022年Hack the box:Tier1免费区全解 | Mox的笔记库 Navidrome部署记录 | Mox的笔记库 长安杯2021-snake复现 | Mox的笔记库 报告概要翻译:OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING 从零开始的Django CVE-2022-28346复现 | Mox的笔记库 2022CISCN(西北区赛)-The shinning | Mox的笔记库 Docker+QEMU+Arm64(Ubuntu)+环境配置(2022版) | Mox的笔记库 Arch Linux运行树莓派系统(2022年) | Mox的笔记库 2022CISCN初赛-ez_usb-复盘WriteUp | Mox的笔记库 NodeMCU-MicroPython配置实录 | Mox的笔记库 Django事务使用 | Mox的笔记库 记录第一次EduSRC上报 | Mox的笔记库 Jetbrain问题应急处理 | Mox的笔记库 Celery5.2学习&配置 | Mox的笔记库 Waline部署记录 | Mox的笔记库 2021年12月 Vivo千镜杯回顾 | Mox的笔记库 Frida hook初次实战 | Mox的笔记库 Log4j2漏洞复现 | Mox的笔记库 Windows的WSL2+Docker初探 | Mox的笔记库
VLDB23阅读:Bringing Compiling Databases to RISC Architectures
2025-03-02 · via Mox的笔记库

https://dl.acm.org/doi/10.14778/3583140.3583142

长期以来,数据库优化(Code Generation in database systems:“数据库的代码生成”)的关键点一直在X86-64架构的芯片上(Note:十分合理,毕竟再怎么吹X86-64芯片不行,但服务器CPU这块X86-64依然占据大多数),本篇论文想讨论数据库的代码生成在ARM与其他RISC架构的情况。

作者团队开发了名为FireARM的数据库代码生成器用于AArch64架构上,证明了Umbra的那套方案在不同架构上是可行的

Introduction

However, while performance increases of x86-64 chips have slowed down over the recent years, other architectures are picking up momentum.

正确的, x86-64 芯片的速度的增长速度确实是下降的

We find that while approaches using standard programming languages or compiler back-ends are comparably easy to develop, they also incur a higher latency for query execution.

我们发现,尽管使用标准编程语言或编译器后端的方法相当容易开发,但它们也会产生更高的查询执行延迟。

别问,问就是Query Compilation都会有这样的问题😂这可不是啥免费的午餐

We have also found that domain-specific IRs can be designed with a focus on databasespecific operations and allow for a more idiomatic expression of complex operations,

好,Domin Specific IR

Prior Work

The generation of machine code from query plans in database systems was first implemented in System R in the last century

这点可以去看看今年的 CMU15799

Currently, many different strategies for query compilation are implemented by modern databases, but there is still no consensus about which way to go

对于IR采用何种形式,确实没有达成共识

Design Space Analysis

三条转化的不同路径,你也可以在我之前的阅读中找到类似:SIGMOD24文章阅读:Query Compilation Without Regrets

image-20250301204253795

3.1 Metrics for Query Compilers

The key factor of a code generation system — and therefore the main focus of our analysis — is the (intermediate) code representation used to express query code, as it strongly impacts performance characteristics, the ability to effectively use hardware features, and the engineering effort for the implementation of the database itself.

“代码生成系统的关键因素(因此我们分析的主要重点)是用于表达查询代码的(中间)代码表示,因为它强烈影响性能特征,有效使用硬件功能的能力以及用于实施数据库本身的工程工作”——It’s Great

In our opinion, these metrics cover most interesting aspects of a query compilation strategy: performance, expressiveness for query translation, and usability.

Performance主要是关于Throughput(吞吐量),Latency(延迟)

For compiling databases, domain expressiveness, as well as architecture expressiveness, supports the quality of compiled query code.

Expressiveness主要是关于Domain Expressiveness(偏重于逻辑计划表达性),Architecture Expressiveness(偏重于物理计划表达性),涉及Query Compilation Code的生成质量

While this metric is certainly the most subjective, we nevertheless consider it as crucial for the development of query compilation systems.

Usability主要是关于开发上手难度的——这一块不考虑🤣(不要嘴硬,上手难度就是高)

3.2 Strategy: Programming Languages

While the generated machine code can generally achieve a high throughput due to compiler optimizations, the latency of the whole code generation process is typically also high

与这里说的一致:SIGMOD24文章阅读:Query Compilation Without Regrets

Compilers like GCC or Clang focus on highperformance machine code but are not designed for low-latency compilation.

正确的,这一块还可以把MLIR加上,其最后实现还是回到LLVM

Systems like Amazon Redshift mitigate this problem by extensively caching parts of previously compiled queries

Mark下

For example, to use SIMD operations provided by modern processors, compilers can apply auto-vectorization,

DuckDB就是采用这个方案Auto-vectorization

The integration of this approach is also rather simple: The database calls an external compiler and generated query programs are loaded as modules.

从使用角度上来说,也是最容易上手的

3.3 Strategy: General-Purpose IRs

most of the important optimizations (e.g., the removal of dead code) are performed at the IR level anyway.

我一直想找这个,但不知道哪篇论文会讲这个细节

Therefore, the expressiveness of compiler IRs is at least as powerful as the supported programming language.

这没啥问题

most IRs are given in Single Static Assignment (SSA) form, which incurs additional complexity for generating code

这块我没想过,但确实有道理

文章还提到LLVM JIT提供工具链不错,但理论上会增加额外的复杂度

3.4 Strategy: Domain-Specific IRs

Domain-specific IRs are tailored to the needs of a database system and can be optimized in different aspects.

The design and structuring of domain-specific IRs for databases is simpler and more expressive than compiler IRs.

这一类IR更专一,也更强

Flounder [13] and Umbra [26] both implement domain-specific IRs with a custom code generator.

哦?Flounder 提供的IR与X86-64架构绑定

Umbra IR is designed with a higher degree of domain-expressiveness compared to Flounder IR and has a higher-level structure that is closer to compiler IRs.

哦!

Umbra IR, as well as Flounder IR, is translated by single-pass code-generators

这一类Pass只需要Single Pass就能运转

Compiler IRs usually lack specialized instructions for representing database-related algorithmic details that domain-specific IRs are free to implement

嗯哼,对的

Defining domain-specific IRs and implementing customized code-generators is far more complex than re-using an existing compiler infrastructure and requires a deep understanding of IR design, compiler development, and computer architecture.

这一块的门槛只会更高

3.5 Analysis

image-20250301225607844

Nevertheless, generating programs with higher throughput than compilers is not easy to achieve and requires heavy engineering. In contrast to that, we suggest that low-latency code generation is only achievable using domain-specific IRs.

这个建议似乎在很多地方看过?

domain-specific IRs must not only be designed, but also code generation and multiple back-ends for machine code must be implemented.

对的

FireARM

Our analysis, however, showed that all approaches face challenges and have different trade-offs.

是Query Compiltion都有Trade-offs

As Kersten et al. [18] showed, Umbra IR allows the direct generation of machine code without using external compiler frameworks or other additional IRs, thereby avoiding the high compilation time of standard frameworks.

这个方案设计的Umbra X86-64执行后端被称为flying start,对应AArch64就是FireARM,是Direct Code Generator

image-20250301233132274

4.1 Compiling Umbra IR

it can also adaptively make use of a standard compiler IR (LLVM IR) for more optimized code generation for longer running tasks. Furthermore, Umbra IR can also be transformed into the C programming language for compilation with a standard compiler.

Umbra IR具备了生成C,生成LLVM IR,也可以绕过LLVM基础设施直接机器二进制码(“In signle pass without an additional, lower-level IR”)

Along with most modern IRs, Umbra IR uses SSA (Static Single Assignment)[35], so IR values (the IR counterpart of variables known from programming languages) are only set once.

Umbra IR还是SSA形式——就是前面说了这种方式会有冗余

Currently, there are no global optimizations, such as the inlining of other functions, as known from compilers.

啊这,FireARM目前没优化

Umbra IR addresses this special case using so-called ghost instructions that implicitly reference results of the previous instruction.

Ghost Instruction?

Listing 2 shows the ghost instruction OverflowResult, that allows using overflow information such as regular IR values.

……The simpler format permits a faster code generation, since the instructions are easier to resolve

image-20250302090235954

看起来确实会简洁

image-20250302090542617

One of the most important aspects that affect the performance of generated code and also the compilation time is register allocation, whose optimal solution is an NP-complete problem.

哦?寄存器分配确实是NP-Complete问题——跟图着色问题类似

FireARM benefits from the larger register set and the three operand principle of a RISC architecture like ARM during code generation.

4.2 Architectural Challenges

image-20250302091457059

主要是关于Memory Ordering,Alignment(内存对齐),Arithmetic Operations,Instruction Selection,Immediate operandsz这几个方面阐述

Figure 5 shows the machine instructions of x86-64 and ARM that are used to implement different C++ memory orderings

X86-64 : processorordering consistency model

ARM : weak-ordering consistency model

While modern x86-64 systems guarantee atomicity for atomic operations on unaligned data if the operation remains in a single cache line, most other architectures like ARM or RISC-V prohibit unaligned atomic accesses altogether

对的,这点可以在ISA手册上可以找到

On such architectures, CPUs often get significant performance benefits by exploiting the weaker memory model, as it allows for a more flexible execution order of memory accesses.

weaker memory mode有助于获得更灵活的内存访问

Table 2 shows the results on two different AArch64 platforms. While the impact is moderate for TPC-H query 3, the performance of queries 1 and 19 are much more sensitive to a stronger ordering

Table2 证明内存管理对Query执行是有影响的

image-20250302111824728

Evaluation

we run all 22 TPC-H queries with 5 code generation back-ends in Umbra

5.1 Experimental Setup

编译为C:使用参数为gcc -O3, GCC11.1

使用LLVM JIT: latency-optimized mode , FastISel , SelectionDAG ,LLVM14

使用FireARM : “ARM-specific back-end for direct code generation”

使用Flying Start : “This back-end is the x86-64-specific back-end for direct code generation”

实验很详细,具体请看论文原文

Several ARM platforms like the Thunder X2, the Graviton 2, and the Raspberry Pi have a weaker single-core performance compared to modern x86-64 processors, causing compilation-based query execution to suffer from higher latencies. …

recent ARM processors like the Apple M1 and the Graviton 3 offer comparable performance to modern x86-64 and are likely to change and increase the diversity in server hardware in the future.

开发板的性能终究赶不上Apple M1这个量级的芯片

For example, on the Fujitsu A64FX, designed for arithmetic-heavy workloads, such optimizations can yield minor performance improvements, whereas on the Graviton 2, a general-purpose CPU, there is no significant difference in throughput.

即便是ARM,服务器版的和General Purpose的也表现效果也不太一样

FireARM is up to 8x faster than DuckDB on ARM platforms and Flying Start is up to 12x times faster on the x86-64 machine.

这个看个乐就行了😆

image-20250302114935752

Discussion

FireARM and Flying Start have only a slightly larger latency than the VM back-end while generating machine code and therefore achieve a much higher throughput.

FireARM和Flying Start有higher throughput理所当然

In addition to that, all compilation approaches outperform the vectorized interpretation of DuckDB on different systems.

倒是好奇,为什么Umbra没有用SIMD?(通篇只说比DuckDB快,没说为什么不用这个)

image-20250302120000807

个人认为这张图有参考意义,但不多😅毕竟Code多的肯定是汇编like,中间IR肯定会少些

Conlusion

We conclude that compilation performance using domain-specific IRs on ARM will increase even further if architectural heterogeneity is taken into account from the beginning, possibly ending the dominance of x86-64 for databases.

“Possibly ending the dominance of x86-64 for databases” 😋

评论

这篇文章我是蛮喜欢的:这一套从Performance,Expressiveness和上手难度的分析方法论我觉得不错

就是小的方面有些瑕疵:如果是Apple M1测验的话,那肯定是在MacOS下测试,这里面与Linux一丢丢系统的差异没有讨论

另外,我觉得不同架构的最大差异应该是SIMD的方案不同,Umbra虽然强调他的速度比DuckDB快,但并未就不同架构的SIMD讨论,这确实有些遗憾。

如果从另外一个角度来看,我认为给数据库单独设计一门Domain Specific IR并不是一个好的建议,虽然拥有非常好的性能,但是每次硬件升级都需要大量人力调优,终究不够Gneral——但哪怕是现在热火朝天的AI,也不会为此单独去造一门语言轮子(也许是还没到,毕竟DeepSeek的JIT方案,都在用Python生成C++然后运行,而不是选择LLVM IR🤣)这一块,除非Umbra开源,不然Domain Specific IR确实不是好主意