VLDB23阅读：Bringing Compiling Databases to RISC Architectures

https://dl.acm.org/doi/10.14778/3583140.3583142

长期以来，数据库优化（Code Generation in database systems：“数据库的代码生成”）的关键点一直在X86-64架构的芯片上（Note：十分合理，毕竟再怎么吹X86-64芯片不行，但服务器CPU这块X86-64依然占据大多数），本篇论文想讨论数据库的代码生成在ARM与其他RISC架构的情况。

作者团队开发了名为FireARM的数据库代码生成器用于AArch64架构上，证明了Umbra的那套方案在不同架构上是可行的

Introduction

However, while performance increases of x86-64 chips have slowed down over the recent years, other architectures are picking up momentum.

正确的， x86-64 芯片的速度的增长速度确实是下降的

We find that while approaches using standard programming languages or compiler back-ends are comparably easy to develop, they also incur a higher latency for query execution.

我们发现，尽管使用标准编程语言或编译器后端的方法相当容易开发，但它们也会产生更高的查询执行延迟。

别问，问就是Query Compilation都会有这样的问题😂这可不是啥免费的午餐

We have also found that domain-specific IRs can be designed with a focus on databasespecific operations and allow for a more idiomatic expression of complex operations,

好，Domin Specific IR

Prior Work

The generation of machine code from query plans in database systems was first implemented in System R in the last century

这点可以去看看今年的 CMU15799

Currently, many different strategies for query compilation are implemented by modern databases, but there is still no consensus about which way to go

对于IR采用何种形式，确实没有达成共识

Design Space Analysis

三条转化的不同路径，你也可以在我之前的阅读中找到类似：SIGMOD24文章阅读：Query Compilation Without Regrets

3.1 Metrics for Query Compilers

The key factor of a code generation system — and therefore the main focus of our analysis — is the (intermediate) code representation used to express query code, as it strongly impacts performance characteristics, the ability to effectively use hardware features, and the engineering effort for the implementation of the database itself.

“代码生成系统的关键因素（因此我们分析的主要重点）是用于表达查询代码的（中间）代码表示，因为它强烈影响性能特征，有效使用硬件功能的能力以及用于实施数据库本身的工程工作”——It’s Great

In our opinion, these metrics cover most interesting aspects of a query compilation strategy: performance, expressiveness for query translation, and usability.

Performance主要是关于Throughput（吞吐量），Latency（延迟）

For compiling databases, domain expressiveness, as well as architecture expressiveness, supports the quality of compiled query code.

Expressiveness主要是关于Domain Expressiveness（偏重于逻辑计划表达性），Architecture Expressiveness（偏重于物理计划表达性），涉及Query Compilation Code的生成质量

While this metric is certainly the most subjective, we nevertheless consider it as crucial for the development of query compilation systems.

Usability主要是关于开发上手难度的——这一块不考虑🤣（不要嘴硬，上手难度就是高）

3.2 Strategy: Programming Languages

While the generated machine code can generally achieve a high throughput due to compiler optimizations, the latency of the whole code generation process is typically also high

与这里说的一致：SIGMOD24文章阅读：Query Compilation Without Regrets

Compilers like GCC or Clang focus on highperformance machine code but are not designed for low-latency compilation.

正确的，这一块还可以把MLIR加上，其最后实现还是回到LLVM

Systems like Amazon Redshift mitigate this problem by extensively caching parts of previously compiled queries

Mark下

For example, to use SIMD operations provided by modern processors, compilers can apply auto-vectorization,

DuckDB就是采用这个方案Auto-vectorization

The integration of this approach is also rather simple: The database calls an external compiler and generated query programs are loaded as modules.

从使用角度上来说，也是最容易上手的

3.3 Strategy: General-Purpose IRs

most of the important optimizations (e.g., the removal of dead code) are performed at the IR level anyway.

我一直想找这个，但不知道哪篇论文会讲这个细节

Therefore, the expressiveness of compiler IRs is at least as powerful as the supported programming language.

这没啥问题

most IRs are given in Single Static Assignment (SSA) form, which incurs additional complexity for generating code

这块我没想过，但确实有道理

文章还提到LLVM JIT提供工具链不错，但理论上会增加额外的复杂度

3.4 Strategy: Domain-Specific IRs

Domain-specific IRs are tailored to the needs of a database system and can be optimized in different aspects.

The design and structuring of domain-specific IRs for databases is simpler and more expressive than compiler IRs.

这一类IR更专一，也更强

Flounder [13] and Umbra [26] both implement domain-specific IRs with a custom code generator.

哦？Flounder 提供的IR与X86-64架构绑定

Umbra IR is designed with a higher degree of domain-expressiveness compared to Flounder IR and has a higher-level structure that is closer to compiler IRs.

哦！

Umbra IR, as well as Flounder IR, is translated by single-pass code-generators

这一类Pass只需要Single Pass就能运转

Compiler IRs usually lack specialized instructions for representing database-related algorithmic details that domain-specific IRs are free to implement

嗯哼，对的

Defining domain-specific IRs and implementing customized code-generators is far more complex than re-using an existing compiler infrastructure and requires a deep understanding of IR design, compiler development, and computer architecture.

这一块的门槛只会更高

3.5 Analysis

Nevertheless, generating programs with higher throughput than compilers is not easy to achieve and requires heavy engineering. In contrast to that, we suggest that low-latency code generation is only achievable using domain-specific IRs.

这个建议似乎在很多地方看过？

domain-specific IRs must not only be designed, but also code generation and multiple back-ends for machine code must be implemented.

对的

FireARM

Our analysis, however, showed that all approaches face challenges and have different trade-offs.

是Query Compiltion都有Trade-offs

As Kersten et al. [18] showed, Umbra IR allows the direct generation of machine code without using external compiler frameworks or other additional IRs, thereby avoiding the high compilation time of standard frameworks.

这个方案设计的Umbra X86-64执行后端被称为flying start，对应AArch64就是FireARM，是Direct Code Generator

4.1 Compiling Umbra IR

it can also adaptively make use of a standard compiler IR (LLVM IR) for more optimized code generation for longer running tasks. Furthermore, Umbra IR can also be transformed into the C programming language for compilation with a standard compiler.

Umbra IR具备了生成C，生成LLVM IR，也可以绕过LLVM基础设施直接机器二进制码（“In signle pass without an additional, lower-level IR”）

Along with most modern IRs, Umbra IR uses SSA (Static Single Assignment)[35], so IR values (the IR counterpart of variables known from programming languages) are only set once.

Umbra IR还是SSA形式——就是前面说了这种方式会有冗余

Currently, there are no global optimizations, such as the inlining of other functions, as known from compilers.

啊这，FireARM目前没优化

Umbra IR addresses this special case using so-called ghost instructions that implicitly reference results of the previous instruction.

Ghost Instruction?

Listing 2 shows the ghost instruction OverflowResult, that allows using overflow information such as regular IR values.

……The simpler format permits a faster code generation, since the instructions are easier to resolve

看起来确实会简洁

One of the most important aspects that affect the performance of generated code and also the compilation time is register allocation, whose optimal solution is an NP-complete problem.

哦？寄存器分配确实是NP-Complete问题——跟图着色问题类似

FireARM benefits from the larger register set and the three operand principle of a RISC architecture like ARM during code generation.

4.2 Architectural Challenges

主要是关于Memory Ordering，Alignment（内存对齐），Arithmetic Operations，Instruction Selection，Immediate operandsz这几个方面阐述

Figure 5 shows the machine instructions of x86-64 and ARM that are used to implement different C++ memory orderings

X86-64 : processorordering consistency model

ARM : weak-ordering consistency model

While modern x86-64 systems guarantee atomicity for atomic operations on unaligned data if the operation remains in a single cache line, most other architectures like ARM or RISC-V prohibit unaligned atomic accesses altogether

对的，这点可以在ISA手册上可以找到

On such architectures, CPUs often get significant performance benefits by exploiting the weaker memory model, as it allows for a more flexible execution order of memory accesses.

weaker memory mode有助于获得更灵活的内存访问

Table 2 shows the results on two different AArch64 platforms. While the impact is moderate for TPC-H query 3, the performance of queries 1 and 19 are much more sensitive to a stronger ordering

Table2 证明内存管理对Query执行是有影响的

Evaluation

we run all 22 TPC-H queries with 5 code generation back-ends in Umbra

5.1 Experimental Setup

编译为C：使用参数为gcc -O3， GCC11.1

使用LLVM JIT: latency-optimized mode , FastISel , SelectionDAG ，LLVM14

使用FireARM ： “ARM-specific back-end for direct code generation”

使用Flying Start ： “This back-end is the x86-64-specific back-end for direct code generation”

实验很详细，具体请看论文原文

Several ARM platforms like the Thunder X2, the Graviton 2, and the Raspberry Pi have a weaker single-core performance compared to modern x86-64 processors, causing compilation-based query execution to suffer from higher latencies. …

recent ARM processors like the Apple M1 and the Graviton 3 offer comparable performance to modern x86-64 and are likely to change and increase the diversity in server hardware in the future.

开发板的性能终究赶不上Apple M1这个量级的芯片

For example, on the Fujitsu A64FX, designed for arithmetic-heavy workloads, such optimizations can yield minor performance improvements, whereas on the Graviton 2, a general-purpose CPU, there is no significant difference in throughput.

即便是ARM，服务器版的和General Purpose的也表现效果也不太一样

FireARM is up to 8x faster than DuckDB on ARM platforms and Flying Start is up to 12x times faster on the x86-64 machine.

这个看个乐就行了😆

Discussion

FireARM and Flying Start have only a slightly larger latency than the VM back-end while generating machine code and therefore achieve a much higher throughput.

FireARM和Flying Start有higher throughput理所当然

In addition to that, all compilation approaches outperform the vectorized interpretation of DuckDB on different systems.

倒是好奇，为什么Umbra没有用SIMD？（通篇只说比DuckDB快，没说为什么不用这个）

个人认为这张图有参考意义，但不多😅毕竟Code多的肯定是汇编like，中间IR肯定会少些

Conlusion

We conclude that compilation performance using domain-specific IRs on ARM will increase even further if architectural heterogeneity is taken into account from the beginning, possibly ending the dominance of x86-64 for databases.

“Possibly ending the dominance of x86-64 for databases” 😋

这篇文章我是蛮喜欢的：这一套从Performance，Expressiveness和上手难度的分析方法论我觉得不错

就是小的方面有些瑕疵：如果是Apple M1测验的话，那肯定是在MacOS下测试，这里面与Linux一丢丢系统的差异没有讨论

另外，我觉得不同架构的最大差异应该是SIMD的方案不同，Umbra虽然强调他的速度比DuckDB快，但并未就不同架构的SIMD讨论，这确实有些遗憾。

如果从另外一个角度来看，我认为给数据库单独设计一门Domain Specific IR并不是一个好的建议，虽然拥有非常好的性能，但是每次硬件升级都需要大量人力调优，终究不够Gneral——但哪怕是现在热火朝天的AI，也不会为此单独去造一门语言轮子（也许是还没到，毕竟DeepSeek的JIT方案，都在用Python生成C++然后运行，而不是选择LLVM IR🤣）这一块，除非Umbra开源，不然Domain Specific IR确实不是好主意

推荐订阅源

Mox的笔记库