惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Mox的笔记库

细嗦下MLIR的环境搭建 | Mox的笔记库 博客重构:从Hexo到Astro | Mox的笔记库 2026PPoPP MLIR Tutorial学习 | Mox的笔记库 MacOS配置《明日方舟:终末地》 | Mox的笔记库 2025:向内生长 | Mox的笔记库 由mlir::ExecutionEngine引发的跨系统问题 | Mox的笔记库 WSL2配置Cuda-Tile环境记录(未完待续) | Mox的笔记库 Vibe Coding手搓项目记录 | Mox的笔记库 给Debian上包——以DuckDB为例 | Mox的笔记库 UCPD.sys事件存档 | Mox的笔记库 换新电脑之Mac mini M4从购买到配置 | Mox的笔记库 Mac配置MLX-C开发环境 | Mox的笔记库 RISC-V meets RDBMS——RISC-V架构上可运行数据库一览 | Mox的笔记库 DuckDB Sort实现调查 | Mox的笔记库 修复Redis在树莓派5上无法运行的问题 | Mox的笔记库 如何在MLIR中自定义类型并且输出运行 | Mox的笔记库 网站网络结构变更记录 | Mox的笔记库 EDBT25论文阅读:PhoebeDB——A Disk-Based RDBMS Kernel for High-Performance and Cost-Effective OLTP SIGMOD25论文阅读:BPF-DB:——A Kernel-Embedded Transactional Database Management System For eBPF Applications SIGMOD24文章阅读:Query Compilation Without Regrets | Mox的笔记库 论文阅读:Designing an Open Framework for Query Optimization and Compilation Apache Arrow Gandiva项目解析 | Mox的笔记库 VLDB24论文阅读:Cloud-Native Database Systems and Unikernels——Reimagining OS Abstractions for Modern Hardware NoisePage源码分析(未完待续) | Mox的笔记库 VLDB20论文阅读:Mainlining Databases——Supporting Fast Transactional Workloads on Universal Columnar Data File Formats VLDB17论文阅读:Relaxed Operator Fusion for In-Memory Databases:Making Compilation, Vectorization, and Prefetching Work Together At Last 论文阅读:How not to structure your database-backed web applications——a study of performance bugs in the wild SIGMOD24阅读:ROME——Robust Query Optimization via Parallel Multi-Plan Execution 文章阅读:First Past the Post-Evaluating Query Optimization in MongoDB SIGMOD文章阅读:Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources VLDB23论文阅读:Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server VLDB论文阅读:Weaving Relations for Cache Performance VLDB22论文阅读:ConnectorX——Accelerating Data Loading From Databases to Dataframes 论文阅读:UniKraft-Fast, Specialized Unikernels the Easy Way 当DuckDB遇上RISC-V | Mox的笔记库 SIGMOD25论文阅读:An Elephant Under The Microscope——Analyzing The Interaction Of Optimizer Components In PostgreSQL 论文阅读:Compile-Time Analysis of Compiler Frameworks for Query Compilation VLDB23阅读:Bringing Compiling Databases to RISC Architectures LingoDB源码编译与分析 | Mox的笔记库 淦!MLIR输出Hello World不应该这么难! | Mox的笔记库 如何愉快的运行一个MLIR程序 | Mox的笔记库 2024:拥挤年代的想象与创造 | Mox的笔记库 如何给自己的博客添加MLIR和LLVM IR语法高亮 | Mox的笔记库 VLDB19-Parsing Gigabytes of JSON per Second论文阅读 CIDR25:Runtime-Extensible Parsers阅读 | Mox的笔记库 MLIR学习资料整理 | Mox的笔记库 SIGMOD24文章阅读:VeriTxn | Mox的笔记库 VLDB23文章阅读——Exploiting Cloud Object Storage for High-Performance Analytics VLDB24——OLAP on Modern Chiplet-Based Processors走马观花阅读 VLDB22:YeSQL文章阅读(已废弃) | Mox的笔记库 如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读 | Mox的笔记库 你好,世界! | Mox的笔记库 让系统研究更有意义:HarmonyOS NEXT的教训和经验——讲座回顾 | Mox的笔记库 UNSW 24T3 COMP9336上课记录 | Mox的笔记库 Velox开发环境配置踩坑记录 | Mox的笔记库 MLIR Toy Tutorial实践记录 | Mox的笔记库 论文阅读:Declarative Sub-Operators for Universal Data Processing LLVM-Kaleidoscope实操踩坑记录 | Mox的笔记库 2024年7月RSSHub开发体验 | Mox的笔记库 澳洲大学计算机硕士比较 | Mox的笔记库 论文阅读——CDUL:CLIP-Driven Unsupervised Learning for Multi-Label Image Classification 论批量快速添加图片与视频水印的事 | Mox的笔记库 CVPR2023-CLIP算法调研 | Mox的笔记库 基于元信息写入的服务器压力测试 | Mox的笔记库 MjAyMw==,希望,前进与平庸之道 | Mox的笔记库 家庭组网IPv6+Mesh折腾 | Mox的笔记库 code-server初体验 | Mox的笔记库 从Nginx到Caddy | Mox的笔记库 Hexo部署安装全流程回顾 | Mox的笔记库 RMM观察与初探 | Mox的笔记库 计算机网络课设——UDP/TCP/TLS Socket实验 | Mox的笔记库 JQuery的XSS初探 | Mox的笔记库 生产实习记录 | Mox的笔记库 Fedora-CoreOS配置与试用(2023年) | Mox的笔记库 Electron学习笔记 | Mox的笔记库 ServerSentEvent学习 | Mox的笔记库 报告翻译:容器云的安全挑战 | Mox的笔记库 Arch Linux迁移计划 | Mox的笔记库 Vagrant配置Metarget靶场环境 | Mox的笔记库 OpenAI-whisper折腾 | Mox的笔记库 202202,困惑,混乱与未曾设想之路 | Mox的笔记库 2022年Hack the box:Tier1免费区全解 | Mox的笔记库 Navidrome部署记录 | Mox的笔记库 长安杯2021-snake复现 | Mox的笔记库 报告概要翻译:OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING 从零开始的Django CVE-2022-28346复现 | Mox的笔记库 2022CISCN(西北区赛)-The shinning | Mox的笔记库 Docker+QEMU+Arm64(Ubuntu)+环境配置(2022版) | Mox的笔记库 Arch Linux运行树莓派系统(2022年) | Mox的笔记库 2022CISCN初赛-ez_usb-复盘WriteUp | Mox的笔记库 NodeMCU-MicroPython配置实录 | Mox的笔记库 Django事务使用 | Mox的笔记库 记录第一次EduSRC上报 | Mox的笔记库 Jetbrain问题应急处理 | Mox的笔记库 Celery5.2学习&配置 | Mox的笔记库 Waline部署记录 | Mox的笔记库 2021年12月 Vivo千镜杯回顾 | Mox的笔记库 Frida hook初次实战 | Mox的笔记库 Log4j2漏洞复现 | Mox的笔记库 Windows的WSL2+Docker初探 | Mox的笔记库
SIGMOD22论文阅读:Efficient Massively Parallel Join Optimization for Large Queries
2025-04-10 · via Mox的笔记库

这篇文章在2025年的CMU-15799的Lecture #09上被提到,提出了一套可以优化大量Join的SQL查询的方案

CMU提供的论文下载链接:https://15799.courses.cs.cmu.edu/spring2025/papers/09-parallelization1/mancini-sigmod2022.pdf

INTRODUCTION

Modern data analytical systems need to efficiently handle such large queries. To handle such scenarios, there have been significant advances in query processing technologies such as sophisticated data layouts, scale-out systems or JIT code generation, while query optimization module has received much lesser attention.

对的,无论是German Style String,还是Umbra的JIT,Query Optimization给人的感觉就非常久远,而且真正理解和运用Query Optimization的人真的很少(我想这也是Andy为什么开这门课的原因)

For instance, PostgreSQL takes as much as around 160 secs to find the optimal plan even for a 21-relation join query1, while SparkSQL takes 1000 secs to plan an 18-relation

有意思,Mark下

In this work, we focus on Dynamic Programming (DP) based join order optimization, which is typically used in current systems [2, 4].

使用DP算法优化Join?Get

Number of join pairs evaluated: DP algorithms typically follow an enumerate-and-evaluate approach …… The fewer the invalid Join-Pairs evaluated, the more efficient the algorithm is.

这里需要补充下enumerate-and-evaluate approach,现列举后评估,常见于暴力搜索,动态搜索和启发式搜索

)Parallelizability: Another way to reduce the optimization time is to perform the join order optimization in parallel. .. Note that not all DP algorithms are easily parallelizable due to dependency between Join-Pairs (detailed in Section 2.1).

哦?什么样的Join-Pairs不好并行?

A comparison of existing join order optimization techniques based on the above two parameters is shown in Figure 2. The Y-axis shows, for an input query, the number of Join-Pairs evaluated by different DP techniques normalized to the total number of valid Join-Pairs for the query. The X-axis shows the parallelizability of the techniques. The evaluation is performed on a 20-relation query from the real world MusicBrainz dataset.

image-20250411093847046

数据效果看起来很Nice

eister et al. in [24] leverage the GPU parallelism to further reduce query optimization time. They propose GPU parallel versions of DPSIZE and DPSUB which scales better than the corresponding CPU parallel ones.

使用GPU大量优化Join?如果数据量足够大能掩盖GPU通信的延迟的话确实可行(但应该对显存要求很高吧)

In this paper, we discuss a novel parallel join order optimization algorithm, MPDP (Massively Parallel DP), which can be executed over GPUs (running with high degree of parallelism) or CPUs.

PROBLEM FRAMEWORK AND BACKGROUND

For a given query, we can represent the joins of the query as a graph G (R, E), where the vertices R = {R1, · · · , Rn } denote the set of all relations in the FROM clause of the query, while the edges, E, correspond to the inner join predicates in the query.

2.1 Valid Join-Pair (CCP-Pair)

image-20250411101516127

Note that any Join-Pair (Sle f t , Sright ) that satisfies all the above conditions is a Connected-subgraph Complement Pair (CCP-Pair)

where Sle f t = {1, 2, 4} and Sright = {6, 7, 8}. Since, there is no edge between these two sets in the join graph, it is not a CCP-Pair. While Sle f t = {1, 2, 4} and Sright = {5, 6} is a CCP-Pair

image-20250411102325920

In this paper, we do not consider cross products which is a well accepted thumb rule in query optimization …… This is primarily because cross products dramatically increase the search space but rarely produce better quality plans.

本篇不考虑叉乘和笛卡尔积

2.2 Generic DPSub Algorithm

We now present the generic DPSUB algorithm. The pseudo-code of DPSUB is shown in Algorithm 1.

image-20250411103055007

2.3 Shortcomings of DPSUB

The results of this evaluation, as shown in Figure 4, suggests that the gap between EvaluatedCounter and CCP-Counter increases with larger queries. Further, EvaluatedCounter is around 2805 times larger (relatively) compared to CCP-Counter at 25 relations.

image-20250411121009423

EvaluatedCounterCCP-Counter的增长速度不一致(但似乎没解释是什么原因?)

这需要看其他文章,还有相关定义

  • EvaluatedCounter:统计所有被枚举并检查的 Join-Pairs 数量(不管是否有效),也就是说它反映了算法所做的总工作量
  • CCP-Counter:统计所有满足条件的CCP-Pairs(Connected-subgraph Complement Pairs) 数量,也即是有效的 Join-Pairs 数量,用来生成实际子计划。

哦,明白了,证明DPSUB的工作任务量很大

2.4 Relevant Graph Theoretic Terminologies

image-20250411143635398

这里开始补充图的知识(毕竟这套算法使用图进行Query Optimization)

Cut Vertex: A cut vertex in an undirected graph is a vertex whose removal (and corresponding removal of all the edges incident on that vertex) increases the number of connected components in the graph. For the example join graph in Figure 5, {4, 5, 9} are cut vertices.

在图论(Graph Theory)中,Cut Vertex(也称为割点)是指一个图中的顶点,如果删除它(连同与它相连的边)会导致图的连通性降低,即:删除这个顶点后,图的连通分量数增加——所以上图的{4,5,9}是割点

还有Nonseparable graph,即指没有割点(cut vertex)的连通图,简单例子就是个环

Biconnected Component or block: A biconnected component (or block) of a given graph is a maximal nonseparable subgraph. Note that a block contains some cut vertices of the graph, but does not have any cut vertex in the block itself. In our example, {1, 2, 3, 4}; {4, 5}; {5, 9}; {6, 7, 8, 9} are blocks.

一个双连通分量就是一个最大的 nonseparable 子图,{1, 2, 3, 4}; {4, 5}; {5, 9}; {6, 7, 8, 9}单独拆出来的话就是Nonseparable graph

Block-Cut Tree: From a given graph G, we can build a bipartite tree, called block-cut tree, as follows. (1) Its vertices are the blocks and the cut vertices of G. (2) There exist an edge between a block and a cut vertex if that cut vertex is included in the block. In our example, the block-cut tree would be a chain: {1, 2, 3, 4} − 4 − {4, 5} − 5 − {5, 9} − 9 − {6, 7, 8, 9}.

Block-Cut Tree 即 块-割点树,这一块直接放GPT给的样例吧😂即图中的{1, 2, 3, 4} − 4 − {4, 5} − 5 − {5, 9} − 9 − {6, 7, 8, 9}.

image-20250411145119398

MPDP: A NEW MASSIVELY PARALLEL OPTIMAL ALGORITHM

3.1 Tree Join Graphs

The pseudo-code of MPDP for the tree scenario is shown in Algorithm 2 …… Note that the pseudo-code only contains the main for-loop corresponding to evaluating set S. We have omitted the rest of the code since it is same as that used in DPSUB (Algorithm 1). Further, we have highlighted in red the difference in code between the two algorithms.

image-20250411145628957

The main idea of the algorithm is the following: Since the join graph is a tree, then the subgraph induced by S is also a tree ……Thus, the algorithm do not incur any CCP conditions checking overheads. Resulting in EvaluatedCounter being equal to CCP-Counter.

这个MPDP树版本的方案(一个简单的MPDP方案)中EvaluatedCounter是等于CCP-Counter的( 这又说明什么呢? 说明MPDP能够有效构建子计划)

算法证明如下:

(1) Only the CCP-Pairs corresponding to connected set S are enumerated, i.e. any Join-Pair(Sle f t , Sright ) ∈ V alid − Join − Pairs (S) is a CCP-Pair (Lemma 1).

image-20250411150608069

证明的逻辑是:

  1. 删除一条边将树分成两个不相交、非空、并集为整体的子集;
  2. 由于原图是连通的,删除一条边后的两个子图仍然各自连通(否则原图也不连通);
  3. 所以该 Join-Pair 满足 CCP 的定义。

(2) All the CCP-Pairs corresponding to connected set S are enumerated (Lemma 2).

image-20250411152256854

这个引理的目的是要说明:

  • 尽管 MPDP:Tree 只枚举树上断一条边形成的 Join-Pairs(看似枚举空间更小),但它并不会遗漏任何合法的 CCP-Pairs
  • 也就是说,DPSUB 能评估的所有 CCP-Pairs,MPDP:Tree 也都能评估出来,二者在结果覆盖度上一致。

由此得出结论:

MPDP:Tree finds the optimal join order while evaluating only CCP-Pairs (meeting the CCP-Counter lower bound)

3.2 Generalization

Then, we perform:

a) edge-based enumerated along the cut edges between blocks;

b) vertex-based enumeration within the blocks.

image-20250411184034514

这部分内容太干,建议看原文😅这部分内容在于证明MPDP有效的

HEURISTIC SOLUTIONS

Although MPDP is efficient, parallelizable and runs well on GPUs, join order optimization is an NP-Hard problem. The time taken for optimization increases exponentially.

Joint Order的优化是NP-Hard problem,优化花费的时间呈指数增加

We propose two heuristic solutions: 1) augmenting MPDP into an existing heuristic technique, IDP; 2) a novel join-graph conscious heuristic, UnionDP.

啊这😂😅说不清楚的事情就是Heuristic是吧

4.1 Iterative Dynamic Programming

IDP2 with MPDP. MPDP can be incorporated into IDP2 by replacing the dp algorithm with MPDP

这一块切实我没看太懂,但关于IDP,这里贴个表记录下:

特点记忆化搜索(Top-down)IDP(Bottom-up)
实现方式递归 + 记忆迭代 + 表格
调用开销大(递归)小(循环)
代码简洁性中等
控制计算顺序不容易控制明确控制
适合复杂依赖关系

4.2 UnionDP

IDP2 might get stuck on a poor local optimum due to the initial plan choice and its greedy nature, resulting in suboptimal plan choices. Hence, we design a novel heuristic, UnionDP, that for the first-time leverages the graph topology for such large queries.

image-20250411155728574

This recursive idea helps UnionDP scale to 1000s of relations. More details of the heuristic is presented in [21].

21指向了这本篇论文的预印本(这什么情况?)https://arxiv.org/abs/2202.13511

MPDP: GPU IMPLEMENTATION

We now present MPDP’s GPU implementation details. GPUs provide a much higher degree of parallelism compared to multi-core CPUs.

“a much higher degree of parallelism”这也是矩阵计算使用GPU的原因

In our implementation, sets of relations (including adjacency lists of base relations) are represented using a fixed-width bitmap sets. The memo table is implemented using the fast Murmur3 hashing algorithm (a simple open-addressing hash table). Algorithm 5 shows the general workflow of MPDP on GPUs.

哎,这有意思,这里可以介绍下Murmur3:

它使用了一些非常高效的操作:位操作(rotate、shift),模拟乘法混合(multiplicative mix),最小化缓存未命中,可以通过在实现中手动指定字节顺序(little endian),避免了在不同硬件架构上出现不同的哈希结果(或许利好GPU运行?)

image-20250411162253807

The presented algorithm is inspired by the COMPGPU algorithm from Meister et. al [24] and could be used to implement on GPU.

24指向了一篇 技术报告

还提到了两个优化:减少分支产生和全局变量读写——使用GPU的软件都会在这两个方面进行优化

文中将其分为两类:Optimal Algorithms和Heuristic Solutions

等于是回顾了下Join Order Optimization的历史😁

Optimal Algorithms

DPSIZE [28] – which is currently used in many open-source and commercial databases, like PostgreSQL and IBM DB2

DPSUB [35] uses a subset driven way of plan enumeration (detailed in Section 2)

Moerkette et al. [25] propose the DPCCP algorithm, which uses a join graph based enumeration, and evaluates only CCP-Pairs

支持更好并行化的

PDP [10] discusses a CPU parallel version of DPSIZE.

DPE [11] proposes a parallelization framework that can parallelize any DP algorithm, including DPCCP, based on a producerconsumer paradigm

也有上GPU的

More recently, Meister et al. [24] proposed GPU versions of DPSIZE and DPSUB algorithms.

Heuristic Solutions

IKKBZ [14, 18] limits the search space to left-deep plans

Trummer and Koch formulate the join order optimization problem as a Mixed Integer Linear Programming (MILP) problem

Techniques such as GOO [8] and min-sel [32] greedily choose the best subplans to find the query plan.

ubplans to find the query plan. IDP [17], introduces a new class of algorithms, called Iterative Dynamic Programming which we have discussed in Section 4.

More recently, Neumann et al. [27] proposed a new technique to reduce the search space, called linearized DP (LinDP)

There are also randomized algorithms proposed based on Simulated Annealing and Iterative Improvement [15], Genetic Algorithms [5, 13], Random Sampling [36].

EXPERIMENTAL RESULTS

实验主要集中在10个及更多的关系上

7.1 Experimental Setup

We use a server with dual Intel Xeon E5-2650L v3 CPU, with each CPU having 12 cores and 24 threads with 755GB of RAM. We use a Nvidia GTX 1080 GPU for running GPU based join algorithms. For the experiments in Section 7.5, we use Amazon AWS

看来是用不上最新的显卡😂不知道用老黄最新的GPU会不会有性能提升

Cost Model: The cost model used by a query optimizer plays an important role in determining the optimization time.

we use a more realistic cost model which is close to the one used by PostgreSQL.

Cost Model采用与PG类似的方案

7.2 Optimal Algorithm Evaluation

各类算法使用的数据参数如下:

• Postgres (1CPU): DPSIZE based join ordering implemented by PostgreSQL running on 1 CPU core.

• DPCCP (1CPU): State-of-the-art CPU Sequential DP algorithm, DPCCP [25], running on 1 CPU core.

• DPE (24CPU): State-of-the-art CPU parallel algorithm, DPE. We use the parallel version of DPCCP [11] running on 24 cores.

• DPSub (GPU): State-of-the-art GPU based DP algorithm [24] using DPSUB. The COMB-GPU version from [24] is used.

• DPSize (GPU): Another state-of-the-art GPU based DP algorithm [24] using DPSIZE. The H+F-GPU version is used here.

Other techniques such as sequential or parallel versions of DPSIZE and DPSUB on CPU run much slower than their GPU variants

哦?

运行了模拟测试(Synthetic Workload)和实际测试(Real-world Workload)

7.3 Heuristic Solution Evaluation

这里贴个BenchMark给大家看看,内容就不分析了😂

image-20250411201732170

7.4 Scalability on CPU

MPDP(CPU)有着比DPE(CPU)更好的Scalability

image-20250411203038062

7.5 Optimization Cost Comparison

Since the CPU algorithms do not scale linearly with large number of cores, we experimented with different AWS size instances and picked the one that is the most cost effective.

image-20250411203510452

关于图的结论:

For smaller queries, PostgreSQL and DPCCP are cheaper. However, for larger queries (beyond 15 rels), MPDP (GPU) turns out to be the cheapest. For instance, MPDP is around an order of magnitude cheaper compared to next best DPE from 23 relations.

超过15个Relation,(MPDP)GPU就会有不错的效果(利好JIT?Realtion少用CPU,非常多的话就用GPU)

评价

通篇读下来感觉论文过于干货😂一方面算法伪代码多,二是并没有这篇Paper并没有公开算法源代码(也许有用Apache Calcite?要不然就只能纯手搓了)

对GPU做Join比较感兴趣,但不得不承认现在显卡都被做AI的拿去了😂所以说GPU做Join是比较花里胡哨的

文章在Section2 补充了有关图的知识,我以为后面MPDP算法有很多有关图的论述,但实际上却只是用来解释CCP-Pairs

如果只是想了解下的话,看CMU15799的PPT就够了(Andy课后提到AMD的APU运行Join或许是个不错的建议?)

image-20250411210054869