惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Mox的笔记库

细嗦下MLIR的环境搭建 | Mox的笔记库 博客重构:从Hexo到Astro | Mox的笔记库 2026PPoPP MLIR Tutorial学习 | Mox的笔记库 MacOS配置《明日方舟:终末地》 | Mox的笔记库 2025:向内生长 | Mox的笔记库 由mlir::ExecutionEngine引发的跨系统问题 | Mox的笔记库 WSL2配置Cuda-Tile环境记录(未完待续) | Mox的笔记库 Vibe Coding手搓项目记录 | Mox的笔记库 给Debian上包——以DuckDB为例 | Mox的笔记库 UCPD.sys事件存档 | Mox的笔记库 换新电脑之Mac mini M4从购买到配置 | Mox的笔记库 Mac配置MLX-C开发环境 | Mox的笔记库 RISC-V meets RDBMS——RISC-V架构上可运行数据库一览 | Mox的笔记库 DuckDB Sort实现调查 | Mox的笔记库 修复Redis在树莓派5上无法运行的问题 | Mox的笔记库 如何在MLIR中自定义类型并且输出运行 | Mox的笔记库 网站网络结构变更记录 | Mox的笔记库 EDBT25论文阅读:PhoebeDB——A Disk-Based RDBMS Kernel for High-Performance and Cost-Effective OLTP SIGMOD25论文阅读:BPF-DB:——A Kernel-Embedded Transactional Database Management System For eBPF Applications SIGMOD24文章阅读:Query Compilation Without Regrets | Mox的笔记库 论文阅读:Designing an Open Framework for Query Optimization and Compilation Apache Arrow Gandiva项目解析 | Mox的笔记库 VLDB24论文阅读:Cloud-Native Database Systems and Unikernels——Reimagining OS Abstractions for Modern Hardware NoisePage源码分析(未完待续) | Mox的笔记库 VLDB20论文阅读:Mainlining Databases——Supporting Fast Transactional Workloads on Universal Columnar Data File Formats VLDB17论文阅读:Relaxed Operator Fusion for In-Memory Databases:Making Compilation, Vectorization, and Prefetching Work Together At Last 论文阅读:How not to structure your database-backed web applications——a study of performance bugs in the wild SIGMOD24阅读:ROME——Robust Query Optimization via Parallel Multi-Plan Execution 文章阅读:First Past the Post-Evaluating Query Optimization in MongoDB VLDB23论文阅读:Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server SIGMOD22论文阅读:Efficient Massively Parallel Join Optimization for Large Queries VLDB论文阅读:Weaving Relations for Cache Performance VLDB22论文阅读:ConnectorX——Accelerating Data Loading From Databases to Dataframes 论文阅读:UniKraft-Fast, Specialized Unikernels the Easy Way 当DuckDB遇上RISC-V | Mox的笔记库 SIGMOD25论文阅读:An Elephant Under The Microscope——Analyzing The Interaction Of Optimizer Components In PostgreSQL 论文阅读:Compile-Time Analysis of Compiler Frameworks for Query Compilation VLDB23阅读:Bringing Compiling Databases to RISC Architectures LingoDB源码编译与分析 | Mox的笔记库 淦!MLIR输出Hello World不应该这么难! | Mox的笔记库 如何愉快的运行一个MLIR程序 | Mox的笔记库 2024:拥挤年代的想象与创造 | Mox的笔记库 如何给自己的博客添加MLIR和LLVM IR语法高亮 | Mox的笔记库 VLDB19-Parsing Gigabytes of JSON per Second论文阅读 CIDR25:Runtime-Extensible Parsers阅读 | Mox的笔记库 MLIR学习资料整理 | Mox的笔记库 SIGMOD24文章阅读:VeriTxn | Mox的笔记库 VLDB23文章阅读——Exploiting Cloud Object Storage for High-Performance Analytics VLDB24——OLAP on Modern Chiplet-Based Processors走马观花阅读 VLDB22:YeSQL文章阅读(已废弃) | Mox的笔记库 如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读 | Mox的笔记库 你好,世界! | Mox的笔记库 让系统研究更有意义:HarmonyOS NEXT的教训和经验——讲座回顾 | Mox的笔记库 UNSW 24T3 COMP9336上课记录 | Mox的笔记库 Velox开发环境配置踩坑记录 | Mox的笔记库 MLIR Toy Tutorial实践记录 | Mox的笔记库 论文阅读:Declarative Sub-Operators for Universal Data Processing LLVM-Kaleidoscope实操踩坑记录 | Mox的笔记库 2024年7月RSSHub开发体验 | Mox的笔记库 澳洲大学计算机硕士比较 | Mox的笔记库 论文阅读——CDUL:CLIP-Driven Unsupervised Learning for Multi-Label Image Classification 论批量快速添加图片与视频水印的事 | Mox的笔记库 CVPR2023-CLIP算法调研 | Mox的笔记库 基于元信息写入的服务器压力测试 | Mox的笔记库 MjAyMw==,希望,前进与平庸之道 | Mox的笔记库 家庭组网IPv6+Mesh折腾 | Mox的笔记库 code-server初体验 | Mox的笔记库 从Nginx到Caddy | Mox的笔记库 Hexo部署安装全流程回顾 | Mox的笔记库 RMM观察与初探 | Mox的笔记库 计算机网络课设——UDP/TCP/TLS Socket实验 | Mox的笔记库 JQuery的XSS初探 | Mox的笔记库 生产实习记录 | Mox的笔记库 Fedora-CoreOS配置与试用(2023年) | Mox的笔记库 Electron学习笔记 | Mox的笔记库 ServerSentEvent学习 | Mox的笔记库 报告翻译:容器云的安全挑战 | Mox的笔记库 Arch Linux迁移计划 | Mox的笔记库 Vagrant配置Metarget靶场环境 | Mox的笔记库 OpenAI-whisper折腾 | Mox的笔记库 202202,困惑,混乱与未曾设想之路 | Mox的笔记库 2022年Hack the box:Tier1免费区全解 | Mox的笔记库 Navidrome部署记录 | Mox的笔记库 长安杯2021-snake复现 | Mox的笔记库 报告概要翻译:OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING 从零开始的Django CVE-2022-28346复现 | Mox的笔记库 2022CISCN(西北区赛)-The shinning | Mox的笔记库 Docker+QEMU+Arm64(Ubuntu)+环境配置(2022版) | Mox的笔记库 Arch Linux运行树莓派系统(2022年) | Mox的笔记库 2022CISCN初赛-ez_usb-复盘WriteUp | Mox的笔记库 NodeMCU-MicroPython配置实录 | Mox的笔记库 Django事务使用 | Mox的笔记库 记录第一次EduSRC上报 | Mox的笔记库 Jetbrain问题应急处理 | Mox的笔记库 Celery5.2学习&配置 | Mox的笔记库 Waline部署记录 | Mox的笔记库 2021年12月 Vivo千镜杯回顾 | Mox的笔记库 Frida hook初次实战 | Mox的笔记库 Log4j2漏洞复现 | Mox的笔记库 Windows的WSL2+Docker初探 | Mox的笔记库
SIGMOD文章阅读:Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
2025-04-21 · via Mox的笔记库

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD.

这篇文章在2025年的CMU-15799的Lecture #20中提及,同时Apache Calcite也是2025年的CMU-15799的Project #1所使用的软件

Apache Calcite项目官网:https://calcite.apache.org/

源码的Mirror: https://github.com/apache/calcite

CMU提供的论文下载地址:https://15799.courses.cs.cmu.edu/spring2025/papers/20-calcite/p221-begoli.pdf

INTRODUCTION

As organizations have invested in data processing systems tailored towards their specific needs, two overarching problems have arisen:

• The developers of such specialized systems have encountered related problems, such as query optimization [4, 25] or the need to support query languages such as SQL and related extensions (e.g., streaming queries [26]) as well as language-integrated queries inspired by LINQ [33]. Without a unifying framework, having multiple engineers independently develop similar optimization logic and language support wastes engineering effort.

• Programmers using these specialized systems often have to integrate several of them together. An organization might rely on Elasticsearch, Apache Spark, and Druid. We need to build systems capable of supporting optimized queries across heterogeneous data sources [55].

一言一概之,就是在Apache Calcite出现之前,Query Optimization是苦活,而且大家还经常造轮子,但这样还没法做到通用🤣

Furthermore, Calcite enables cross-platform optimization by exposing a common interface to multiple systems. To be efficient, the optimizer needs to reason globally, e.g., make decisions across different systems about materialized view selection.

应该是最早提出来这个说法的吧(Apache DataFusion和LingoDB都是后来的事情了

Building a common framework does not come without challenges. In particular, the framework needs to be extensible and flexible enough to accommodate the different types of systems requiring integration.

要想让这样一个Query Optimization的框架具备灵活,可扩展的同时,还想要具备高性能,这可能有些难度😄

Flexible query optimizer. Each component of the optimizer is pluggable and extensible, ranging from rules to cost models. In addition**, Calcite includes support for multiple planning engines. Hence, the optimization can be broken down into phases handled by different optimization engines** depending on which one is best suited for the stage.

优化可以自选阶段运行,这个想法在后面的发展中基本得到了延续

Calcite is reliable, as its wide adoption over many years has led to exhaustive testing of the platform. Calcite also contains an extensive test suite validating all components of the system including query optimizer rules and integration with backend data sources.

不是,有充足测试样例也能写到论文里头?😅

Orca [45] is a modular query optimizer used in data management products such as Greenplum and HAWQ.

听Andy在课上说Orca也不错,但在后面由于收购问题停止开源了

此外还提到了Spark的Catalyst Optimzer,Garlic,FORWARD,BigDAWG,Myria等方案

ARCHITECTURE

Figure 1 outlines the main components of Calcite’s architecture. Calcite’s optimizer uses a tree of relational operators as its internal representation.

image-20250421231705678

可以看到,是一套围绕Java展开的Data Processing Sytem

First, Calcite contains a query parser and validator that can translate a SQL query to a tree of relational operators.

这一套SQL方案借鉴了JavaCC + FreeMarker

QUERY ALGEBRA

For instance, it has become common for OLAP, decision making, and streaming applications to use window definitions to express complex analytic functions such as moving average of a quantity over a time period or number or rows.

似乎对SQL Windows函数有单独优化?

For example, consider joining a Products table held in MySQL to an Orders table held in Splunk (see Figure 2). Initially, the scan of Orders takes place in the splunk convention and the scan of Products is in the jdbc-mysql convention.

??能跨数据库进行Join操作,Wow🤩(但需要怎么解决不同数据库读取时间的差异?)

image-20250421232955351

ADAPTERS

Figure 3 depicts its components. Essentially, an adapter consists of a model, a schema, and a schema factory.

The model is a specification of the physical properties of the data source being accessed.

A schema is the definition of the data (format and layouts) found in the model.

……

The schema factory component acquires the metadata information from the model and generates a schema.

The data itself is physically accessed via tables.

image-20250421233202568

感觉这个设计模式很Java😂

Calcite uses a physical trait known as the calling convention to identify relational operators which correspond to a specific database backend.

Calling Convention? 记录下

QUERY PROCESSING AND OPTIMIZATION

For an example of a rule with more complex effects, consider the following query:

SELECT products.name , COUNT(*)

FROM sales

JOIN products

USING (productId)

WHERE sales.discount IS NOT NULL

GROUP BY products.name

ORDER BY COUNT(*) DESC;

The query corresponds to the relational algebra expression presented in Figure 4a. Because the WHERE clause only applies to the sales table, we can move the filter before the join as in Figure 4b.

image-20250422100859746

That is,下推优化🤔

Metadata providers.

Metadata is an important part of Calcite’s optimizer, and it serves two main purposes:

(i) guiding the planner towards the goal of reducing the cost of the overall query plan, and (ii) providing information to the rules while they are being applied.

Metadata是Query Optimize能运行的关键,无论是FIlter下推(Push Down),还是之前文章里提到的Cardinality Estimation都依赖于Metadata

单独设计一个Metadata Provider我认为是正确的

As the metadata providers are pluggable, they are compiled and instantiated at runtime using Janino [27], a Java lightweight compiler.

还能可插拔?😯

Janino项目地址:https://janino-compiler.github.io/janino/

Planner engines

提供Cost-based planner engine和exhaustive planner

这两者可以任意切换

Materialized views

既然能跨数据库读取,那Materizlize View想必也是可行的

EXTENDING CALCITE

Calcite还可以支持多种多样的数据

怎么感觉这一章是在水😂

Semi-structured Data

For example, Calcite contains an adapter for MongoDB [36], a document store which stores documents consisting of data roughly equivalent to JSON documents.

不奇怪吧,既然MongoDB能实现,那Calcite也实现技术上不是问题

Streaming

Calcite provides first-class support for streaming queries [26] based on a set of streaming-specific extensions to standard SQL, namely STREAM extensions, windowing extensions, implicit references to streams via window expressions in joins, and others.

流式处理则是Spark等一众大数据平台优化的方案

这一块真不熟😅

Geospatial Queries

Geospatial support is preliminary in Calcite, but is being implemented using Calcite’s relational algebra. The core of this implementation consists in adding a new GEOMETRY data type which encapsulates different geometric objects such as points, curves

An example query finds the country which contains the city of Amsterdam:

SELECT name FROM (

SELECT name , ST_GeomFromText('POLYGON((4.82 52.43, 4.97 52.43, 4.97 52.33, 4.82 52.33, 4.82 52.43))') AS "Amsterdam",

ST_GeomFromText(boundary) AS "Country"

FROM country )

WHERE ST_Contains("Country", "Amsterdam");

看起来像PG的那个GEO

Language-Integrated Query for Java

Though SQL remains the primary database language, many programmers favour language-integrated languages like LINQ

……

Calcite provides Language-Integrated Query for Java (or LINQ4J, in short) which closely follows the convention set forth by Microsoft’s LINQ for the .NET languages.

Calcite用了Linq4j读取Java或数据库数据

Linq建议大家去了解下,我在上面看到了类似Prisma的感觉,确实挺强的(应该是先有Linq,后有的Prisma)

INDUSTRY AND ACADEMIA ADOPTION

image-20250422104250266

image-20250422104535254

论文发表时支持的平台和Adapter的情况

FUTURE WORK

Though Calcite contains a performance testing module, it does not evaluate query execution. It would be useful to assess the performance of systems built with Calcite.

啊这😅这就是这篇文章没有BenchMark的原因?

Based on real-world experience, we believe that more ambitious goals are possible for integrated multiple systems: they should be superior to the sum of their parts.

All right, be calm😆

结论

一篇罕见的,居然没有benchmark的论文

文章篇幅不长,没有让人头昏脑胀的数据,看起来很舒服

一方面,这确实是一个很有意思的工作:一个大一统的SQL适配方案

但另一方面,Query Optimize确实没什么人去做🤔,这一块需要搞技术投入,但产出却十分有限

如果硬要说的话,就是Apache Calcite是否为了通用性,进行了很多性能上的妥协,会不会出现木桶效应?而且使用的还是火山模型,而非比较时兴的PushModel

如果是想系统学习的话,可以看看知乎上的这篇文章:Apache Calcite系列(一):整体流程解析