惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Mox的笔记库

细嗦下MLIR的环境搭建 | Mox的笔记库 博客重构:从Hexo到Astro | Mox的笔记库 2026PPoPP MLIR Tutorial学习 | Mox的笔记库 MacOS配置《明日方舟:终末地》 | Mox的笔记库 2025:向内生长 | Mox的笔记库 由mlir::ExecutionEngine引发的跨系统问题 | Mox的笔记库 WSL2配置Cuda-Tile环境记录(未完待续) | Mox的笔记库 Vibe Coding手搓项目记录 | Mox的笔记库 给Debian上包——以DuckDB为例 | Mox的笔记库 UCPD.sys事件存档 | Mox的笔记库 换新电脑之Mac mini M4从购买到配置 | Mox的笔记库 Mac配置MLX-C开发环境 | Mox的笔记库 RISC-V meets RDBMS——RISC-V架构上可运行数据库一览 | Mox的笔记库 DuckDB Sort实现调查 | Mox的笔记库 修复Redis在树莓派5上无法运行的问题 | Mox的笔记库 如何在MLIR中自定义类型并且输出运行 | Mox的笔记库 网站网络结构变更记录 | Mox的笔记库 EDBT25论文阅读:PhoebeDB——A Disk-Based RDBMS Kernel for High-Performance and Cost-Effective OLTP SIGMOD25论文阅读:BPF-DB:——A Kernel-Embedded Transactional Database Management System For eBPF Applications SIGMOD24文章阅读:Query Compilation Without Regrets | Mox的笔记库 论文阅读:Designing an Open Framework for Query Optimization and Compilation Apache Arrow Gandiva项目解析 | Mox的笔记库 VLDB24论文阅读:Cloud-Native Database Systems and Unikernels——Reimagining OS Abstractions for Modern Hardware NoisePage源码分析(未完待续) | Mox的笔记库 VLDB20论文阅读:Mainlining Databases——Supporting Fast Transactional Workloads on Universal Columnar Data File Formats VLDB17论文阅读:Relaxed Operator Fusion for In-Memory Databases:Making Compilation, Vectorization, and Prefetching Work Together At Last 论文阅读:How not to structure your database-backed web applications——a study of performance bugs in the wild SIGMOD24阅读:ROME——Robust Query Optimization via Parallel Multi-Plan Execution 文章阅读:First Past the Post-Evaluating Query Optimization in MongoDB SIGMOD文章阅读:Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources VLDB23论文阅读:Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server SIGMOD22论文阅读:Efficient Massively Parallel Join Optimization for Large Queries VLDB论文阅读:Weaving Relations for Cache Performance 论文阅读:UniKraft-Fast, Specialized Unikernels the Easy Way 当DuckDB遇上RISC-V | Mox的笔记库 SIGMOD25论文阅读:An Elephant Under The Microscope——Analyzing The Interaction Of Optimizer Components In PostgreSQL 论文阅读:Compile-Time Analysis of Compiler Frameworks for Query Compilation VLDB23阅读:Bringing Compiling Databases to RISC Architectures LingoDB源码编译与分析 | Mox的笔记库 淦!MLIR输出Hello World不应该这么难! | Mox的笔记库 如何愉快的运行一个MLIR程序 | Mox的笔记库 2024:拥挤年代的想象与创造 | Mox的笔记库 如何给自己的博客添加MLIR和LLVM IR语法高亮 | Mox的笔记库 VLDB19-Parsing Gigabytes of JSON per Second论文阅读 CIDR25:Runtime-Extensible Parsers阅读 | Mox的笔记库 MLIR学习资料整理 | Mox的笔记库 SIGMOD24文章阅读:VeriTxn | Mox的笔记库 VLDB23文章阅读——Exploiting Cloud Object Storage for High-Performance Analytics VLDB24——OLAP on Modern Chiplet-Based Processors走马观花阅读 VLDB22:YeSQL文章阅读(已废弃) | Mox的笔记库 如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读 | Mox的笔记库 你好,世界! | Mox的笔记库 让系统研究更有意义:HarmonyOS NEXT的教训和经验——讲座回顾 | Mox的笔记库 UNSW 24T3 COMP9336上课记录 | Mox的笔记库 Velox开发环境配置踩坑记录 | Mox的笔记库 MLIR Toy Tutorial实践记录 | Mox的笔记库 论文阅读:Declarative Sub-Operators for Universal Data Processing LLVM-Kaleidoscope实操踩坑记录 | Mox的笔记库 2024年7月RSSHub开发体验 | Mox的笔记库 澳洲大学计算机硕士比较 | Mox的笔记库 论文阅读——CDUL:CLIP-Driven Unsupervised Learning for Multi-Label Image Classification 论批量快速添加图片与视频水印的事 | Mox的笔记库 CVPR2023-CLIP算法调研 | Mox的笔记库 基于元信息写入的服务器压力测试 | Mox的笔记库 MjAyMw==,希望,前进与平庸之道 | Mox的笔记库 家庭组网IPv6+Mesh折腾 | Mox的笔记库 code-server初体验 | Mox的笔记库 从Nginx到Caddy | Mox的笔记库 Hexo部署安装全流程回顾 | Mox的笔记库 RMM观察与初探 | Mox的笔记库 计算机网络课设——UDP/TCP/TLS Socket实验 | Mox的笔记库 JQuery的XSS初探 | Mox的笔记库 生产实习记录 | Mox的笔记库 Fedora-CoreOS配置与试用(2023年) | Mox的笔记库 Electron学习笔记 | Mox的笔记库 ServerSentEvent学习 | Mox的笔记库 报告翻译:容器云的安全挑战 | Mox的笔记库 Arch Linux迁移计划 | Mox的笔记库 Vagrant配置Metarget靶场环境 | Mox的笔记库 OpenAI-whisper折腾 | Mox的笔记库 202202,困惑,混乱与未曾设想之路 | Mox的笔记库 2022年Hack the box:Tier1免费区全解 | Mox的笔记库 Navidrome部署记录 | Mox的笔记库 长安杯2021-snake复现 | Mox的笔记库 报告概要翻译:OBFUSCATING C++ PROGRAMS VIA CONTROL FLOW FLATTENING 从零开始的Django CVE-2022-28346复现 | Mox的笔记库 2022CISCN(西北区赛)-The shinning | Mox的笔记库 Docker+QEMU+Arm64(Ubuntu)+环境配置(2022版) | Mox的笔记库 Arch Linux运行树莓派系统(2022年) | Mox的笔记库 2022CISCN初赛-ez_usb-复盘WriteUp | Mox的笔记库 NodeMCU-MicroPython配置实录 | Mox的笔记库 Django事务使用 | Mox的笔记库 记录第一次EduSRC上报 | Mox的笔记库 Jetbrain问题应急处理 | Mox的笔记库 Celery5.2学习&配置 | Mox的笔记库 Waline部署记录 | Mox的笔记库 2021年12月 Vivo千镜杯回顾 | Mox的笔记库 Frida hook初次实战 | Mox的笔记库 Log4j2漏洞复现 | Mox的笔记库 Windows的WSL2+Docker初探 | Mox的笔记库
VLDB22论文阅读:ConnectorX——Accelerating Data Loading From Databases to Dataframes
2025-03-26 · via Mox的笔记库

SFU的工作,关于解决DataBase和DataFrame之间的Gap问题

数据一般都是存在DataBase里,但数据科学的人员一般都是用DataFrame

Note: 从VLDB下载的文章似乎有问题——复制粘贴会有乱码

文章:https://dl.acm.org/doi/10.14778/3551793.3551847

Github仓库:https://github.com/sfu-db/connector-x

Introduction

Thus, the first step in most data science applications is to load data from the DBMS. Unfortunately, this data loading process is not only notoriously slow but also consumes inordinate amounts of client memory, which easily leads to out-of-memory errors or performance degradation.

如果是上课用的小数据这不算啥,但如果数据量大的话那可确实是“notoriously slow”,但”Out-of-memory”这就有点搞了😆

bridging the gap between databases and dataframes is of great interest to both academia and industry

真实的,确凿的

This paper describes ConnectorX, a fast and memory-efficient data loading library that supports many DBMSs (e.g., PostgreSQL, SQLite, MySQL, SQLServer, Oracle) to client dataframes (e.g., Pandas, PyArrow, Modin, Dask, and Polars).

这看起来不错👍

image-20250326173204981

从图表可以看出,确实有效降低了读取时间和使用的内存

文章提出了四个问题,这些问题会在后边得到解答

First, where are the actual data loading bottlenecks?

Second, how do we both reduce the runtime and memory, while also making the system extensible to new DBMSs?

Third, are widely used client query partitioning scheme good enough?

Fourth, does a new data loading library matter?

文章指出:Bottlenecks来自服务端和客户端

而对于第二个问题,他们设计了一套DSL解决

In Section 4, we design a succinct domain-specific language (DSL) for mapping DBMS result representations into dataframes—this reduces the lines of code by 1-2 orders of magnitude as compared to not using our DSL.

单独设计一门DSL解决这个问题?😯看看后面怎么说

It is also integrated into popular open source projects such as Polars [17], Modin [61] and DataPrep [14].

我写文章时ConnectX的stars数是2.2k,看来大家是认可这项工作的

Server层面

Arrow-Flight(零拷贝成本完成往Arrow IPC的转换)

各类云厂商的ETL工具

Client层面

Pandas,Modin, Dask, Spark各自有文件读取方案

Integrating DBMS with Data Science

Embedded analytical system DuckDB [64] is developed to avoid the bottlenecks of result set serialization and value-based APIs by making DBMS and analytic tools in the same address space.

DuckDB也是个不错的选项(Server端和Client端不会有太大的传输问题)

AN ANATOMY OF PANDAS.READ_SQL

Where Does the Time Go?

From the client’s perspective, the overall process has three major steps:

(1) Execution + Transfer: Server executes the query and sends the result back to the client through network in bytes following a specific wire protocol.

(2) Deserialization: Client parses the received bytes and returns the query result in the form of Python objects.

(3) Conversion to dataframe: Client converts the Python objects into NumPy [46] arrays and constructs the final dataframe.

列举Client可能出现的造成延时的问题

image-20250326181212276

A surprising finding is that the majority of the time is actually spent on the client side rather than on the query execution

Where Does the Memory Go?

Next, we inspect the memory footprint of running read_sql and show the results in Table 1. Raw Bytes, Python Objects, and Dataframe represent the size of the bytes the client received, the intermediate Python objects, and the final dataframe, respectively.

image-20250326182240237

有意思,虽然知道Raw Bytes转到Dataframe会数据膨胀,但能达到两到三倍,峰值可以到8-10倍确实没想到过

In Python, every object contains a header structure that maintains information like reference count and object’s type in addition to the data itself. This will add some overhead on the size of the data.

Get,文章列举了Python Object有各种数据膨胀

Specically, Pandas.read_sql keeps three copies of the entire data in memory, which are stored in three different formats: Raw Bytes, Python Objects, and Dataframe.

卧槽,Pandas.read_sql居然同时保留三样数据😫Get

How Much Can Chunking Help?

有意思,数据科学的人员遇到这类问题一般都会选择分块解决

We see that chunking is indeed very elective in reducing memory usage because it does not hold all the intermediate results in memory. …… However, it has little help in improving the running time of read_sql

Chunk有助于解决内存问题,但时间确实无法降低,还需要编写额外的代码

How to Speed Up?

image-20250326185100279

Learning from chunking, ConnectorX adopts a streaming work￿ow, where the client loads and processes a small batch of data at a time.

流式分块处理

Figure 4 illustrates the overall work￿ow, which consists of two two phases: Preparation Phase ( 1 - 3 ) and Execution Phase ( 4 - 6 ).

准备阶段和执行阶段,Get

ConnectorX automatically sets the range by issuing query SELECT MIN(ID), MAX(ID) FROM Students. Then, ConnectorX equally partitions the range into 3 splits and generate three subqueries

q1: SELECT … FROM Students WHERE ID < 1,000,000

q2: SELECT … FROM Students WHERE ID 2 [1,000,000, 2,000,000)

q3: SELECT … FROM Students WHERE ID 2,000,000

将SQL语句拆成三条进行并行化执行

However, the buffers that the string objects point to have to be allocated on-the-fly after knowing the actual length of each value.

数据库的字符串存储确实是一个问题

Moreover, constructing a string object is not thread-safe in Python. It needs to acquire the Python Global Interpreter Lock (GIL), which could slow down the whole process when the degree of parallelism is large

哦?Python还有这事?Mark

To alleviate this overhead, ConnectorX constructs a batch of strings at a time while acquiring the GIL instead of allocating each string object separately.

通过一次构造一批字符串从而缓解这个问题

We choose Rust since it provides efficient performance and guarantees memory safety. In addition, there is a variety of high-performance client drivers for different databases in Rust that ConnectorX can directly build on

对,ConnectorX是用Rust开发的,这点很棒

How to Extend?

Overall Architecture. ConnectorX consists of three main modules: Source (e.g. PostgreSQL, MySQL), Destination (e.g. Pandas, PyArrow) and a bridge module Type Mapper in the middle to de￿ne how to convert the physical representation for the data from Source to Destination.

image-20250326193158202

这个Type Mapper……看看后面怎么说?

A Type Mapper module consists of a set of rules that specify how to convert data from a specific Source type to a specific Destination type. During runtime, each subquery will be handled by a single thread.

这里的SubQuery没有明说,我猜是不是Bulk的一部分先从Source到Destination

To mitigate the aforementioned issues, ConnectorX de￿nes a domain specific language (DSL) to help the developers define the type mappings, leveraging the modern macro support in Rust

啧,定义了一套DSL用于应对复杂类型转换

Each line consists of three parts: logical type and corresponding physical type of Source, the matched logical type and physical type of Destination, and the conversion implementation choice including auto, none, and option.

看起来怎么那么像MLIR的Value的体系😂

image-20250326204539841

This simple DSL makes the relation of type mapping intuitive and easy to maintain. We found it has helped shorten code related to type mapping by 97% (from 37k to 1k lines of code).

看起来不错,达成DSL所应具备的(不知道DSL实现模板怎么样)

QUERY PARTITIONING

Client-Side Query Partitioning

Figure 7 shows the network utilization of ConnectorX by varying the number of partitions. It is clear that No Partition cannot saturate the network bandwidth at all.

image-20250326205207627

虽然Client-Side Query Partitioning被广泛应用,但会造成数据不一致性和用户负担

Server-Side Result Partitioning

To conclude, server-side partitioning allows the client to fully leverage the network and computation resources

他们在PG服务端上进行了测验,服务端的Partitioning基本不会有什么问题

EVALUATION

使用TPC-H(大小SF10)和一个名为DDoS的数据集

详情可见https://github.com/sfu-db/connectorx-bench

image-20250326212151576

image-20250326212247108

没什么争议,无论是想的,还是实际效果都很棒🤩

论文里还有Ablation Study,我对前面的结果很满意,这里就不放了

思考

这个数据库开发提了个醒:新设计的数据库应当具备良好的转为DataFrame的特性,这点我想也是DuckDB能走红的原因

那个Type Mapper给我一种Prisma的感觉,但细看又像MLIR的Value体系(其实MLIR就是个非常棒的DSL设计体系)