惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 司徒正美
aimingoo的专栏
aimingoo的专栏
MongoDB | Blog
MongoDB | Blog
云风的 BLOG
云风的 BLOG
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 聂微东
Y
Y Combinator Blog
T
Tailwind CSS Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
S
SegmentFault 最新的问题
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 【当耐特】
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
J
Java Code Geeks
美团技术团队
Google DeepMind News
Google DeepMind News
博客园_首页
Apple Machine Learning Research
Apple Machine Learning Research
T
The Blog of Author Tim Ferriss

DEV Community

Building CogniPlan: A Local-First Task Planning System I built CodeArchy: an open-source that turns any codebase into a visual, explainable architectural experience, powered by Gemma 4. The Day Our Bot Ran Out of Money How we're using Gemini Embeddings to build a smarter, community-driven feed on DEV The Speculative Decoding Pattern The PKCE "Gotcha" in Expo’s exchangeCodeAsync TharVA : Keeping India's Desert Heritage Alive with Offline AI (Gemma4) n8n for Healthcare: 5 Automations for Clinics, Practices, and Health Tech Teams (Free Workflow JSON) How I Built an OWASP Memory Guard for AI Agents (ASI06) Condition-Based vs Time-Based Maintenance: Making the Switch I Tested Spam Protection on Formspree vs Formgrid. The Results Were Surprising. May 27 - Video Understanding Workshop Beyond Keywords: How Google's 2026 Algorithms are Redefining SEO From Click to Cart: Ensuring an Accessible Customer Journey in WooCommerce Your company won't replace you with good AI. They'll replace you with bad AI. How to Use an SVG Icon Search Engine as a Claude Custom Connector O fim do “modelo que faz tudo”? Conheça o Conductor, a IA que orquestra outras IAs 10 First-Principles Strategies to Learn Any Programming Language Deeply 10 First-Principles Strategies to Learn Any Programming Language Deeply Understanding Embeddings easily. The Hidden Cost of “Move Fast and Break Things” Why Your Logs Are Useless Without Traces DressCode: Your AI Stylist for Tomorrow The Documented Shortcoming of Our Production Treasure Hunt Engine I'm 16, and I Built an AI Tool That Audits Your Technical Debt Without Ever Touching code Building Your Own Crypto Poker Bot: A Developer's Guide to Blockchain Gaming Logic Apache Iceberg Metadata Tables: Querying the Internals Hermes, The Self-Improving Agent You Can Actually Run Yourself Unity vs Unreal: 5 Things I Had to Relearn the Hard Way Building Agentic Commerce Infrastructure: Overcoming SQLite Concurrency for Autonomous Procurement Agents Solana Accounts vs Databases HTML Table Borders I built a skill that makes AI-generated AWS diagrams actually usable My first post! I'm kinda excited The Page Root Was the Wrong Unit How to audit what your IDE extension actually sends to the cloud I Migrated 23 Make.com Scenarios to n8n and Cut My Bill by 60% — Complete Migration Guide (2026) Solving a Logistics Problem Using Genetic Algorithms Claude Code Skills Explained: What They Are & When to Use Them (2026) Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers We scanned 8 B2B SaaS companies across 5 categories. ChatGPT named the same 12 brands in every answer. How To "Market" Yourself As A Tech Pro We scanned 500 MCP servers on Smithery. Here is what we found. HTML Basics for Beginners – Markup Language, Elements and Types of CSS DiffWhisperer: How I Turned Cryptic Git Diffs into Architectural Stories with Gemma 4 I built a version manager for llama.cpp using nothing but vibe coding. Unit Testing vs System Testing: Key Differences, Use Cases, and Best Practices for 2026 A game design textbook explains why products with fewer features win How to Build a Raydium Launchpad Bonding Curve in 5 Minutes with forgekit How to turn an AI prototype into a production system How Data Lake Table Storage Degrades Over Time Partition and Sort Keys on DynamoDB: Modeling data for batch-and-stream convergence Auto-Generate Optimized GitHub Actions Workflows For Any Stack With This New CLI Tool Unchaining the African Creator Economy The Treasure Hunt Engine Gotcha - A Lesson in Constrained Performance great_cto v2.17 - no more tambourine dance When Catalogs Are Embedded in Storage SafeMind AI: Instant Health & Safety Intelligence What Is PKCE, How It Works & Flow Examples AI Agent Failure Modes Beyond Hallucination Fastest Way to Understand Stryker Solana Accounts Explained to a Web2 Developer TV Yayın Akışı Sitesi Geliştirirken Öğrendiğim Teknik Dersler $500 Challenge Drop My First Look at Google's Gemma 4: A Quick Introduction How I use an LLM as a translation judge Best Calendar and Scheduling API for Developers — 2026 Comparison Agentic AI in Travel: Why UCP Isn't Travel-Ready Yet — and What We Measured I Finished Machine Learning. And Then Changed The Plan. The Five-Thousand-Line File The AI Whirlwind: Why Your Local Agent Matters More Than Ever I Built an Oracle DBA That Lives in Telegram. It Cut a 500K-Row Scan to 5 - After Asking Permission. The Day 2 Reality of Running a Kubernetes Lab on Your Mac: Stop/Start, CKS Scenarios, and What I Learned Building It. n8n for Airtable Power Users: 5 Automations That Take Your Base to the Next Level Validating Gemma 4 for Industrial IoT: A Governance Pattern VS Code Now Credits Copilot on Every Commit by Default Astro and Islands Architecture: Why Your Portfolio Doesn't Need React for Everything Booting from FAT12: How I added file reading to my x86 kernel Unity’s AI agent went public: the developers of a static analysis tool on what that means for code quality Anna's Archive publica un llms.txt para los LLMs que rastrean su catálogo CRDTs for Offline-First Mobile Sync Why I Built Mneme HQ: Preventing AI Agent Architectural Drift Google Antigravity 2.0 Is the I/O 2026 Announcement You Should Actually Care About I Built a Pay-Per-Call Crypto Signal API with x402 — Heres the Architecture JWT Token Refresh Patterns in React 19: Avoiding the Silent Auth Death Spiral 🚀 “From Prompts to Autonomous Agents: What Google I/O 2026 Changed” The Power of Distributed Consensus in Autonomous SOCs Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern The Hardest Part of Being a Developer Isn't Coding Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it
Using Apache Iceberg with Python and MPP Query Engines
Alex Merced · 2026-05-23 · via DEV Community

This is Part 12 of a 15-part Apache Iceberg Masterclass. Part 11 covered metadata tables. This article covers the two main ways to access Iceberg data: directly from Python libraries and through MPP (massively parallel processing) query engines.

Table of Contents

  1. What Are Table Formats and Why Were They Needed?
  2. The Metadata Structure of Current Table Formats
  3. Performance and Apache Iceberg's Metadata
  4. Technical Deep Dive on Partition Evolution
  5. Technical Deep Dive on Hidden Partitioning
  6. Writing to an Apache Iceberg Table
  7. What Are Lakehouse Catalogs?
  8. Embedded Catalogs: S3 Tables and MinIO AI Stor
  9. How Iceberg Table Storage Degrades Over Time
  10. Maintaining Apache Iceberg Tables
  11. Apache Iceberg Metadata Tables
  12. Using Iceberg with Python and MPP Engines
  13. Streaming Data into Apache Iceberg Tables
  14. Hands-On with Iceberg Using Dremio Cloud
  15. Migrating to Apache Iceberg

The Python Ecosystem for Iceberg

How Python libraries and MPP engines connect to Iceberg tables

PyIceberg: Native Python Access

PyIceberg is the official Python library for Apache Iceberg. It reads Iceberg metadata directly and can scan data files without an external query engine.

from pyiceberg.catalog import load_catalog

# Connect to a REST catalog
catalog = load_catalog("my_catalog", **{
    "type": "rest",
    "uri": "https://catalog.example.com",
})

# Load and scan a table
table = catalog.load_table("analytics.orders")
scan = table.scan(row_filter="amount > 100")
df = scan.to_pandas()

Enter fullscreen mode Exit fullscreen mode

The five-step PyIceberg workflow from catalog connection to analysis

PyIceberg leverages Iceberg's metadata-driven pruning: the row_filter is pushed down to manifest evaluation, so only relevant data files are read. For reading subsets of large tables into Python for analysis or ML training, this is remarkably efficient.

PyIceberg also supports writes (appending data from Arrow tables), schema evolution, and table management operations. It connects to any catalog that implements the REST protocol, including Dremio Open Catalog.

DuckDB: SQL-Based Python Analysis

DuckDB can read Iceberg tables through its Iceberg extension:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")

df = conn.execute("""
    SELECT customer_id, SUM(amount) as total
    FROM iceberg_scan('s3://warehouse/orders')
    GROUP BY customer_id
""").fetchdf()

Enter fullscreen mode Exit fullscreen mode

DuckDB processes the query locally using its columnar execution engine, which is significantly faster than pandas for analytical queries. It supports Iceberg's partition pruning and column statistics for file skipping. DuckDB runs entirely in-process, so there is no separate server to manage. This makes it a strong choice for local analysis, CI/CD data validation, and notebooks where starting a Spark cluster would be overkill.

DuckDB also supports reading Iceberg metadata tables, which means you can use it for table health diagnostics without standing up a full query engine.

Polars: High-Performance DataFrames

Polars can read Iceberg tables through its scan_iceberg method, providing lazy evaluation and parallel processing:

import polars as pl

df = pl.scan_iceberg("s3://warehouse/orders").filter(
    pl.col("amount") > 100
).collect()

Enter fullscreen mode Exit fullscreen mode

Polars uses a lazy evaluation model: the scan_iceberg call does not read data immediately. Instead, it builds an execution plan. When collect() is called, Polars optimizes the plan (predicate pushdown, column pruning, parallel reads) and executes it. For large Iceberg tables, Polars can scan data several times faster than pandas because it uses all available CPU cores and processes data in Apache Arrow columnar format.

Writing from Python

PyIceberg supports writes through Apache Arrow tables:

import pyarrow as pa

# Create an Arrow table with new data
new_data = pa.table({
    "order_id": [1001, 1002, 1003],
    "amount": [150.00, 275.50, 89.99],
    "order_date": ["2024-03-15", "2024-03-15", "2024-03-16"],
})

# Append to the Iceberg table
table.append(new_data)

Enter fullscreen mode Exit fullscreen mode

This creates a new Iceberg commit with the data files, manifests, and metadata. PyIceberg handles the entire write lifecycle, including partition assignment based on the table's partition spec.

For bulk writes from Python, using PyIceberg with Arrow is often simpler than setting up Spark. However, PyIceberg runs on a single machine, so it is not suitable for writing terabyte-scale datasets. For that, use an MPP engine.

MPP Query Engines

Comparison of MPP engines for Iceberg workloads showing read, write, and maintenance capabilities

For production workloads at scale, Python libraries running on a single machine are not sufficient. MPP engines distribute query execution across multiple nodes, handling petabyte-scale tables with sub-minute response times.

Dremio

Dremio provides full Iceberg support with several unique capabilities: query federation across Iceberg and non-Iceberg sources, automatic table optimization through Open Catalog, a semantic layer for governed access, and AI-powered analytics through its built-in agent and MCP server.

For Python users, Dremio exposes data through Apache Arrow Flight, which is a high-performance data transfer protocol. Arrow Flight sends data in columnar Arrow format directly to the client, avoiding the serialization overhead of JDBC/ODBC. This makes it 10-100x faster than traditional database connectors for large result sets:

from dremio_simple_query import DremioConnection

conn = DremioConnection("https://your-dremio.cloud", token="...")
df = conn.query("SELECT * FROM analytics.orders WHERE amount > 100")

Enter fullscreen mode Exit fullscreen mode

The result is a pandas DataFrame populated via Arrow Flight. Because the data stays in Arrow format end-to-end (Iceberg Parquet to Dremio to Arrow Flight to pandas), there are no format conversion bottlenecks.

Dremio also provides a Columnar Cloud Cache that stores frequently accessed data on local NVMe drives, making subsequent queries against the same Iceberg data dramatically faster without requiring reflections or materialized views.

Spark

Apache Spark is the most mature Iceberg engine for both reads and writes. It handles batch ETL, streaming ingestion (Part 13), and all maintenance operations. Most Iceberg production pipelines use Spark for data ingestion because of its extensive connector ecosystem (Kafka, JDBC, file formats) and its ability to process large volumes across a distributed cluster.

Spark supports all Iceberg operations: CREATE, INSERT, MERGE, DELETE, UPDATE, schema evolution, partition evolution, and every maintenance procedure (compaction, snapshot expiry, orphan cleanup).

Trino

Trino (formerly PrestoSQL) is optimized for interactive, ad-hoc queries with low latency. It reads and writes Iceberg tables and supports the REST catalog protocol. Trino is popular for exploration and dashboarding workloads where sub-second response times matter and data is being read rather than written. Its architecture keeps no persistent state, making it easy to scale up and down based on query demand.

Other Engines

Several other engines provide Iceberg support: AWS Athena (serverless, AWS-native), Snowflake (read-only for external Iceberg tables), StarRocks (sub-second analytics), and Doris (real-time analytics). The Iceberg community maintains a compatibility matrix showing which engines support which operations.

Choosing the Right Approach

Choosing the Right Approach

The key takeaway: Python libraries (PyIceberg, DuckDB, Polars) are best for local analysis and development. MPP engines (Dremio, Spark, Trino) are necessary for production-scale analytics. Many teams use both: PyIceberg for data science experimentation, and Dremio for production dashboards and governed access.

Part 13 covers how to stream data into Iceberg tables.

Books to Go Deeper

Free Resources