SIGMOD22论文阅读：Efficient Massively Parallel Join Optimization for Large Queries

这篇文章在2025年的CMU-15799的Lecture #09上被提到，提出了一套可以优化大量Join的SQL查询的方案

CMU提供的论文下载链接：https://15799.courses.cs.cmu.edu/spring2025/papers/09-parallelization1/mancini-sigmod2022.pdf

INTRODUCTION

Modern data analytical systems need to efficiently handle such large queries. To handle such scenarios, there have been significant advances in query processing technologies such as sophisticated data layouts, scale-out systems or JIT code generation, while query optimization module has received much lesser attention.

对的，无论是German Style String，还是Umbra的JIT，Query Optimization给人的感觉就非常久远，而且真正理解和运用Query Optimization的人真的很少（我想这也是Andy为什么开这门课的原因）

For instance, PostgreSQL takes as much as around 160 secs to find the optimal plan even for a 21-relation join query1, while SparkSQL takes 1000 secs to plan an 18-relation

有意思，Mark下

In this work, we focus on Dynamic Programming (DP) based join order optimization, which is typically used in current systems [2, 4].

使用DP算法优化Join？Get

Number of join pairs evaluated: DP algorithms typically follow an enumerate-and-evaluate approach …… The fewer the invalid Join-Pairs evaluated, the more efficient the algorithm is.

这里需要补充下enumerate-and-evaluate approach，现列举后评估，常见于暴力搜索，动态搜索和启发式搜索

)Parallelizability: Another way to reduce the optimization time is to perform the join order optimization in parallel. .. Note that not all DP algorithms are easily parallelizable due to dependency between Join-Pairs (detailed in Section 2.1).

哦？什么样的Join-Pairs不好并行？

A comparison of existing join order optimization techniques based on the above two parameters is shown in Figure 2. The Y-axis shows, for an input query, the number of Join-Pairs evaluated by different DP techniques normalized to the total number of valid Join-Pairs for the query. The X-axis shows the parallelizability of the techniques. The evaluation is performed on a 20-relation query from the real world MusicBrainz dataset.

数据效果看起来很Nice

eister et al. in [24] leverage the GPU parallelism to further reduce query optimization time. They propose GPU parallel versions of DPSIZE and DPSUB which scales better than the corresponding CPU parallel ones.

使用GPU大量优化Join？如果数据量足够大能掩盖GPU通信的延迟的话确实可行（但应该对显存要求很高吧）

In this paper, we discuss a novel parallel join order optimization algorithm, MPDP (Massively Parallel DP), which can be executed over GPUs (running with high degree of parallelism) or CPUs.

PROBLEM FRAMEWORK AND BACKGROUND

For a given query, we can represent the joins of the query as a graph G (R, E), where the vertices R = {R1, · · · , Rn } denote the set of all relations in the FROM clause of the query, while the edges, E, correspond to the inner join predicates in the query.

2.1 Valid Join-Pair (CCP-Pair)

Note that any Join-Pair (Sle f t , Sright ) that satisfies all the above conditions is a Connected-subgraph Complement Pair (CCP-Pair)

where Sle f t = {1, 2, 4} and Sright = {6, 7, 8}. Since, there is no edge between these two sets in the join graph, it is not a CCP-Pair. While Sle f t = {1, 2, 4} and Sright = {5, 6} is a CCP-Pair

In this paper, we do not consider cross products which is a well accepted thumb rule in query optimization …… This is primarily because cross products dramatically increase the search space but rarely produce better quality plans.

本篇不考虑叉乘和笛卡尔积

2.2 Generic DPSub Algorithm

We now present the generic DPSUB algorithm. The pseudo-code of DPSUB is shown in Algorithm 1.

2.3 Shortcomings of DPSUB

The results of this evaluation, as shown in Figure 4, suggests that the gap between EvaluatedCounter and CCP-Counter increases with larger queries. Further, EvaluatedCounter is around 2805 times larger (relatively) compared to CCP-Counter at 25 relations.

EvaluatedCounter和CCP-Counter的增长速度不一致（但似乎没解释是什么原因？）

这需要看其他文章，还有相关定义

EvaluatedCounter：统计所有被枚举并检查的 Join-Pairs 数量（不管是否有效），也就是说它反映了算法所做的总工作量。
CCP-Counter：统计所有满足条件的CCP-Pairs（Connected-subgraph Complement Pairs） 数量，也即是有效的 Join-Pairs 数量，用来生成实际子计划。

哦，明白了，证明DPSUB的工作任务量很大

2.4 Relevant Graph Theoretic Terminologies

这里开始补充图的知识（毕竟这套算法使用图进行Query Optimization）

Cut Vertex: A cut vertex in an undirected graph is a vertex whose removal (and corresponding removal of all the edges incident on that vertex) increases the number of connected components in the graph. For the example join graph in Figure 5, {4, 5, 9} are cut vertices.

在图论（Graph Theory）中，Cut Vertex（也称为割点）是指一个图中的顶点，如果删除它（连同与它相连的边）会导致图的连通性降低，即：删除这个顶点后，图的连通分量数增加——所以上图的{4，5，9}是割点

还有Nonseparable graph，即指没有割点（cut vertex）的连通图，简单例子就是个环

Biconnected Component or block: A biconnected component (or block) of a given graph is a maximal nonseparable subgraph. Note that a block contains some cut vertices of the graph, but does not have any cut vertex in the block itself. In our example, {1, 2, 3, 4}; {4, 5}; {5, 9}; {6, 7, 8, 9} are blocks.

一个双连通分量就是一个最大的 nonseparable 子图，{1, 2, 3, 4}; {4, 5}; {5, 9}; {6, 7, 8, 9}单独拆出来的话就是Nonseparable graph

Block-Cut Tree: From a given graph G, we can build a bipartite tree, called block-cut tree, as follows. (1) Its vertices are the blocks and the cut vertices of G. (2) There exist an edge between a block and a cut vertex if that cut vertex is included in the block. In our example, the block-cut tree would be a chain: {1, 2, 3, 4} − 4 − {4, 5} − 5 − {5, 9} − 9 − {6, 7, 8, 9}.

Block-Cut Tree 即块-割点树，这一块直接放GPT给的样例吧😂即图中的{1, 2, 3, 4} − 4 − {4, 5} − 5 − {5, 9} − 9 − {6, 7, 8, 9}.

MPDP: A NEW MASSIVELY PARALLEL OPTIMAL ALGORITHM

3.1 Tree Join Graphs

The pseudo-code of MPDP for the tree scenario is shown in Algorithm 2 …… Note that the pseudo-code only contains the main for-loop corresponding to evaluating set S. We have omitted the rest of the code since it is same as that used in DPSUB (Algorithm 1). Further, we have highlighted in red the difference in code between the two algorithms.

The main idea of the algorithm is the following: Since the join graph is a tree, then the subgraph induced by S is also a tree ……Thus, the algorithm do not incur any CCP conditions checking overheads. Resulting in EvaluatedCounter being equal to CCP-Counter.

这个MPDP树版本的方案（一个简单的MPDP方案）中EvaluatedCounter是等于CCP-Counter的（ ~~这又说明什么呢？~~ 说明MPDP能够有效构建子计划）

算法证明如下：

(1) Only the CCP-Pairs corresponding to connected set S are enumerated, i.e. any Join-Pair(Sle f t , Sright ) ∈ V alid − Join − Pairs (S) is a CCP-Pair (Lemma 1).

证明的逻辑是：

删除一条边将树分成两个不相交、非空、并集为整体的子集；
由于原图是连通的，删除一条边后的两个子图仍然各自连通（否则原图也不连通）；
所以该 Join-Pair 满足 CCP 的定义。

(2) All the CCP-Pairs corresponding to connected set S are enumerated (Lemma 2).

这个引理的目的是要说明：

尽管 MPDP:Tree 只枚举树上断一条边形成的 Join-Pairs（看似枚举空间更小），但它并不会遗漏任何合法的 CCP-Pairs；
也就是说，DPSUB 能评估的所有 CCP-Pairs，MPDP:Tree 也都能评估出来，二者在结果覆盖度上一致。

由此得出结论：

MPDP:Tree finds the optimal join order while evaluating only CCP-Pairs (meeting the CCP-Counter lower bound)

3.2 Generalization

Then, we perform:

a) edge-based enumerated along the cut edges between blocks;

b) vertex-based enumeration within the blocks.

这部分内容太干，建议看原文😅这部分内容在于证明MPDP有效的

HEURISTIC SOLUTIONS

Although MPDP is efficient, parallelizable and runs well on GPUs, join order optimization is an NP-Hard problem. The time taken for optimization increases exponentially.

Joint Order的优化是NP-Hard problem，优化花费的时间呈指数增加

We propose two heuristic solutions: 1) augmenting MPDP into an existing heuristic technique, IDP; 2) a novel join-graph conscious heuristic, UnionDP.

啊这😂😅说不清楚的事情就是Heuristic是吧

4.1 Iterative Dynamic Programming

IDP2 with MPDP. MPDP can be incorporated into IDP2 by replacing the dp algorithm with MPDP

这一块切实我没看太懂，但关于IDP，这里贴个表记录下：

特点	记忆化搜索（Top-down）	IDP（Bottom-up）
实现方式	递归 + 记忆	迭代 + 表格
调用开销	大（递归）	小（循环）
代码简洁性	高	中等
控制计算顺序	不容易控制	明确控制
适合复杂依赖关系	是	否

4.2 UnionDP

IDP2 might get stuck on a poor local optimum due to the initial plan choice and its greedy nature, resulting in suboptimal plan choices. Hence, we design a novel heuristic, UnionDP, that for the first-time leverages the graph topology for such large queries.

This recursive idea helps UnionDP scale to 1000s of relations. More details of the heuristic is presented in [21].

21指向了这本篇论文的预印本（这什么情况？）https://arxiv.org/abs/2202.13511

MPDP: GPU IMPLEMENTATION

We now present MPDP’s GPU implementation details. GPUs provide a much higher degree of parallelism compared to multi-core CPUs.

“a much higher degree of parallelism”这也是矩阵计算使用GPU的原因

In our implementation, sets of relations (including adjacency lists of base relations) are represented using a fixed-width bitmap sets. The memo table is implemented using the fast Murmur3 hashing algorithm (a simple open-addressing hash table). Algorithm 5 shows the general workflow of MPDP on GPUs.

哎，这有意思，这里可以介绍下Murmur3:

它使用了一些非常高效的操作：位操作（rotate、shift），模拟乘法混合（multiplicative mix），最小化缓存未命中，可以通过在实现中手动指定字节顺序（little endian），避免了在不同硬件架构上出现不同的哈希结果（或许利好GPU运行？）

The presented algorithm is inspired by the COMPGPU algorithm from Meister et. al [24] and could be used to implement on GPU.

24指向了一篇技术报告

还提到了两个优化：减少分支产生和全局变量读写——使用GPU的软件都会在这两个方面进行优化

文中将其分为两类：Optimal Algorithms和Heuristic Solutions

等于是回顾了下Join Order Optimization的历史😁

Optimal Algorithms

DPSIZE [28] – which is currently used in many open-source and commercial databases, like PostgreSQL and IBM DB2

DPSUB [35] uses a subset driven way of plan enumeration (detailed in Section 2)

Moerkette et al. [25] propose the DPCCP algorithm, which uses a join graph based enumeration, and evaluates only CCP-Pairs

支持更好并行化的

PDP [10] discusses a CPU parallel version of DPSIZE.

DPE [11] proposes a parallelization framework that can parallelize any DP algorithm, including DPCCP, based on a producerconsumer paradigm

也有上GPU的

More recently, Meister et al. [24] proposed GPU versions of DPSIZE and DPSUB algorithms.

Heuristic Solutions

IKKBZ [14, 18] limits the search space to left-deep plans

Trummer and Koch formulate the join order optimization problem as a Mixed Integer Linear Programming (MILP) problem

Techniques such as GOO [8] and min-sel [32] greedily choose the best subplans to find the query plan.

ubplans to find the query plan. IDP [17], introduces a new class of algorithms, called Iterative Dynamic Programming which we have discussed in Section 4.

More recently, Neumann et al. [27] proposed a new technique to reduce the search space, called linearized DP (LinDP)

There are also randomized algorithms proposed based on Simulated Annealing and Iterative Improvement [15], Genetic Algorithms [5, 13], Random Sampling [36].

EXPERIMENTAL RESULTS

实验主要集中在10个及更多的关系上

7.1 Experimental Setup

We use a server with dual Intel Xeon E5-2650L v3 CPU, with each CPU having 12 cores and 24 threads with 755GB of RAM. We use a Nvidia GTX 1080 GPU for running GPU based join algorithms. For the experiments in Section 7.5, we use Amazon AWS

看来是用不上最新的显卡😂不知道用老黄最新的GPU会不会有性能提升

Cost Model: The cost model used by a query optimizer plays an important role in determining the optimization time.

we use a more realistic cost model which is close to the one used by PostgreSQL.

Cost Model采用与PG类似的方案

7.2 Optimal Algorithm Evaluation

各类算法使用的数据参数如下：

• Postgres (1CPU): DPSIZE based join ordering implemented by PostgreSQL running on 1 CPU core.

• DPCCP (1CPU): State-of-the-art CPU Sequential DP algorithm, DPCCP [25], running on 1 CPU core.

• DPE (24CPU): State-of-the-art CPU parallel algorithm, DPE. We use the parallel version of DPCCP [11] running on 24 cores.

• DPSub (GPU): State-of-the-art GPU based DP algorithm [24] using DPSUB. The COMB-GPU version from [24] is used.

• DPSize (GPU): Another state-of-the-art GPU based DP algorithm [24] using DPSIZE. The H+F-GPU version is used here.

Other techniques such as sequential or parallel versions of DPSIZE and DPSUB on CPU run much slower than their GPU variants

哦？

运行了模拟测试（Synthetic Workload）和实际测试（Real-world Workload）

7.3 Heuristic Solution Evaluation

这里贴个BenchMark给大家看看，内容就不分析了😂

7.4 Scalability on CPU

MPDP（CPU）有着比DPE(CPU)更好的Scalability

7.5 Optimization Cost Comparison

Since the CPU algorithms do not scale linearly with large number of cores, we experimented with different AWS size instances and picked the one that is the most cost effective.

关于图的结论：

For smaller queries, PostgreSQL and DPCCP are cheaper. However, for larger queries (beyond 15 rels), MPDP (GPU) turns out to be the cheapest. For instance, MPDP is around an order of magnitude cheaper compared to next best DPE from 23 relations.

超过15个Relation，（MPDP）GPU就会有不错的效果（利好JIT？Realtion少用CPU，非常多的话就用GPU）

评价

通篇读下来感觉论文过于干货😂一方面算法伪代码多，二是并没有这篇Paper并没有公开算法源代码（也许有用Apache Calcite？要不然就只能纯手搓了）

对GPU做Join比较感兴趣，但不得不承认现在显卡都被做AI的拿去了😂所以说GPU做Join是比较花里胡哨的

文章在Section2 补充了有关图的知识，我以为后面MPDP算法有很多有关图的论述，但实际上却只是用来解释CCP-Pairs

如果只是想了解下的话，看CMU15799的PPT就够了（Andy课后提到AMD的APU运行Join或许是个不错的建议？）

推荐订阅源

Mox的笔记库