惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 司徒正美
aimingoo的专栏
aimingoo的专栏
MongoDB | Blog
MongoDB | Blog
云风的 BLOG
云风的 BLOG
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 聂微东
Y
Y Combinator Blog
T
Tailwind CSS Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
S
SegmentFault 最新的问题
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 【当耐特】
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
J
Java Code Geeks
美团技术团队
Google DeepMind News
Google DeepMind News
博客园_首页
Apple Machine Learning Research
Apple Machine Learning Research
T
The Blog of Author Tim Ferriss

DEV Community

Hermes, The Self-Improving Agent You Can Actually Run Yourself Unity vs Unreal: 5 Things I Had to Relearn the Hard Way Building Agentic Commerce Infrastructure: Overcoming SQLite Concurrency for Autonomous Procurement Agents Solana Accounts vs Databases HTML Table Borders I built a skill that makes AI-generated AWS diagrams actually usable My first post! I'm kinda excited The Page Root Was the Wrong Unit How to audit what your IDE extension actually sends to the cloud I Migrated 23 Make.com Scenarios to n8n and Cut My Bill by 60% — Complete Migration Guide (2026) Solving a Logistics Problem Using Genetic Algorithms Claude Code Skills Explained: What They Are & When to Use Them (2026) Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers We scanned 8 B2B SaaS companies across 5 categories. ChatGPT named the same 12 brands in every answer. How To "Market" Yourself As A Tech Pro We scanned 500 MCP servers on Smithery. Here is what we found. HTML Basics for Beginners – Markup Language, Elements and Types of CSS DiffWhisperer: How I Turned Cryptic Git Diffs into Architectural Stories with Gemma 4 I built a version manager for llama.cpp using nothing but vibe coding. Unit Testing vs System Testing: Key Differences, Use Cases, and Best Practices for 2026 A game design textbook explains why products with fewer features win How to Build a Raydium Launchpad Bonding Curve in 5 Minutes with forgekit How to turn an AI prototype into a production system How Data Lake Table Storage Degrades Over Time Partition and Sort Keys on DynamoDB: Modeling data for batch-and-stream convergence Auto-Generate Optimized GitHub Actions Workflows For Any Stack With This New CLI Tool Unchaining the African Creator Economy The Treasure Hunt Engine Gotcha - A Lesson in Constrained Performance great_cto v2.17 - no more tambourine dance When Catalogs Are Embedded in Storage SafeMind AI: Instant Health & Safety Intelligence What Is PKCE, How It Works & Flow Examples AI Agent Failure Modes Beyond Hallucination Fastest Way to Understand Stryker Solana Accounts Explained to a Web2 Developer TV Yayın Akışı Sitesi Geliştirirken Öğrendiğim Teknik Dersler $500 Challenge Drop My First Look at Google's Gemma 4: A Quick Introduction How I use an LLM as a translation judge Best Calendar and Scheduling API for Developers — 2026 Comparison Agentic AI in Travel: Why UCP Isn't Travel-Ready Yet — and What We Measured I Finished Machine Learning. And Then Changed The Plan. The Five-Thousand-Line File The AI Whirlwind: Why Your Local Agent Matters More Than Ever I Built an Oracle DBA That Lives in Telegram. It Cut a 500K-Row Scan to 5 - After Asking Permission. The Day 2 Reality of Running a Kubernetes Lab on Your Mac: Stop/Start, CKS Scenarios, and What I Learned Building It. n8n for Airtable Power Users: 5 Automations That Take Your Base to the Next Level Validating Gemma 4 for Industrial IoT: A Governance Pattern VS Code Now Credits Copilot on Every Commit by Default Astro and Islands Architecture: Why Your Portfolio Doesn't Need React for Everything Booting from FAT12: How I added file reading to my x86 kernel Unity’s AI agent went public: the developers of a static analysis tool on what that means for code quality Anna's Archive publica un llms.txt para los LLMs que rastrean su catálogo CRDTs for Offline-First Mobile Sync Why I Built Mneme HQ: Preventing AI Agent Architectural Drift Google Antigravity 2.0 Is the I/O 2026 Announcement You Should Actually Care About I Built a Pay-Per-Call Crypto Signal API with x402 — Heres the Architecture JWT Token Refresh Patterns in React 19: Avoiding the Silent Auth Death Spiral 🚀 “From Prompts to Autonomous Agents: What Google I/O 2026 Changed” The Power of Distributed Consensus in Autonomous SOCs Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern The Hardest Part of Being a Developer Isn't Coding Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it 25 React Interview Questions 2026 (With Answers) — Hooks, React 19, Concurrent Mode An open source LLM eval tool with two independent quality signals Using Dashboard Filtering to Get Customer Usage in Seconds from TBs of Data Skills, Java 17, And Theme Accents 4 Hard Lessons on Optimizing AI Coding Agents Arctype: Cross-Platform Database GUI for LLM Artifacts Your robots.txt says GPTBot is welcome. Your server says 403. Organizing How to Use AWS Glue Workflow 5 n8n Automations Every Digital Agency Should Be Running (Bill More, Work Less) Getting Started with TorchGeo — Remote Sensing with PyTorch Designing a Scalable Cross-Platform Appium Framework Google Antigravity 2.0 & Slash Commands Building a Unified Adaptive Learning Intelligence with Gemma 4, Flutter, and Multi-Model Orchestration Looking for beta testers for a £60 server management application The Disk-Pressure Incident That Taught Me to Always Set LimitRanges and Other Lessons from Mirroring EKS Locally. Why AI Should Not Write SQL Against ERP Databases Vibe coding works until it doesn't. The debt is real. Shipping at the Edge: Migrating a Coffee Subscription Platform to Cloudflare Workers Stop Tab-Switching: A Developer's Guide to Color Tools That Actually Fit the Workflow DevOps vs MLOps vs AIOps: What Changes, What Stays, and a Simple Roadmap to Get Started Run Powerful AI Coding Locally on a Normal Laptop 5 n8n Automations Every WooCommerce Store Needs (Save 10+ Hours/Week) What I Learned Building My Own AI Harness Hytale Servers Will Fail Treasure Hunts Until We Fix Our Event Handling Redux in React: Managing Global State Like a Pro Unfreezing Your GitHub Actions: Troubleshooting Stuck Deployments and Protecting Your Git Repo Statistics Unlocking Project Discoverability on GHES: A Key to Software Engineering Productivity
Apache Iceberg Metadata Tables: Querying the Internals
Alex Merced · 2026-05-22 · via DEV Community

This is Part 11 of a 15-part Apache Iceberg Masterclass. Part 10 covered maintenance operations. This article covers the metadata tables that let you inspect Iceberg table internals using standard SQL.

Iceberg exposes its internal metadata as queryable virtual tables. You can use them to check table health, debug performance issues, audit changes, and build monitoring dashboards. No special tools required, just SQL.

Table of Contents

  1. What Are Table Formats and Why Were They Needed?
  2. The Metadata Structure of Current Table Formats
  3. Performance and Apache Iceberg's Metadata
  4. Technical Deep Dive on Partition Evolution
  5. Technical Deep Dive on Hidden Partitioning
  6. Writing to an Apache Iceberg Table
  7. What Are Lakehouse Catalogs?
  8. Embedded Catalogs: S3 Tables and MinIO AI Stor
  9. How Iceberg Table Storage Degrades Over Time
  10. Maintaining Apache Iceberg Tables
  11. Apache Iceberg Metadata Tables
  12. Using Iceberg with Python and MPP Engines
  13. Streaming Data into Apache Iceberg Tables
  14. Hands-On with Iceberg Using Dremio Cloud
  15. Migrating to Apache Iceberg

The Seven Metadata Tables

The seven Iceberg metadata tables and what each reveals about your table

Snapshots

The $snapshots table lists every snapshot in the table's history. Each row represents a committed transaction.

-- Dremio syntax
SELECT * FROM TABLE(table_snapshot('analytics.orders'))

-- Spark syntax
SELECT * FROM analytics.orders.snapshots

Enter fullscreen mode Exit fullscreen mode

Key columns: snapshot_id, committed_at, operation (append, overwrite, delete), summary (files added/removed counts).

History

The $history table shows the timeline of which snapshot was current at each point in time.

SELECT * FROM TABLE(table_history('analytics.orders'))

Enter fullscreen mode Exit fullscreen mode

Files

The $files table lists every data file in the current snapshot with detailed statistics.

SELECT file_path, file_size_in_bytes, record_count, partition
FROM TABLE(table_files('analytics.orders'))

Enter fullscreen mode Exit fullscreen mode

This is the primary diagnostic table for checking file sizes and identifying the small file problem.

Manifests

The $manifests table lists the manifest files for the current snapshot.

SELECT path, length, added_data_files_count, existing_data_files_count
FROM TABLE(table_manifests('analytics.orders'))

Enter fullscreen mode Exit fullscreen mode

Partitions

The $partitions table provides statistics per partition: row counts, file counts, and size.

SELECT partition, record_count, file_count
FROM TABLE(table_partitions('analytics.orders'))

Enter fullscreen mode Exit fullscreen mode

Practical Use Cases

Three categories of metadata table use cases: monitoring, debugging, and auditing

Monitoring: Average File Size

SELECT
  AVG(file_size_in_bytes) / 1048576 AS avg_file_mb,
  MIN(file_size_in_bytes) / 1048576 AS min_file_mb,
  COUNT(*) AS total_files
FROM TABLE(table_files('analytics.orders'))

Enter fullscreen mode Exit fullscreen mode

If avg_file_mb drops below 64, schedule compaction.

Debugging: Files Per Partition

SELECT partition, COUNT(*) AS files, SUM(record_count) AS rows
FROM TABLE(table_files('analytics.orders'))
GROUP BY partition
ORDER BY files DESC
LIMIT 20

Enter fullscreen mode Exit fullscreen mode

Partitions with hundreds of files are compaction candidates. Use this query as a daily health check and pipe the results into your monitoring system.

Debugging: Sort Order Effectiveness

Column statistics in the files table reveal whether your sort order is effective:

SELECT
  file_path,
  lower_bounds['customer_id'] AS min_customer_id,
  upper_bounds['customer_id'] AS max_customer_id
FROM TABLE(table_files('analytics.orders'))

Enter fullscreen mode Exit fullscreen mode

If the min/max ranges overlap heavily across files, the sort order has decayed and compaction with sorting (Part 10) will restore effectiveness.

Monitoring: Commit Velocity

Track how frequently the table is being written to:

SELECT
  DATE_TRUNC('hour', committed_at) AS hour,
  COUNT(*) AS commits,
  SUM(CAST(summary['added-data-files'] AS INT)) AS files_added
FROM TABLE(table_snapshot('analytics.orders'))
WHERE committed_at > CURRENT_TIMESTAMP - INTERVAL '24' HOUR
GROUP BY DATE_TRUNC('hour', committed_at)
ORDER BY hour

Enter fullscreen mode Exit fullscreen mode

High commit velocity (hundreds of commits per hour) indicates a streaming workload that needs aggressive compaction.

Auditing: Recent Changes

SELECT committed_at, operation, summary
FROM TABLE(table_snapshot('analytics.orders'))
ORDER BY committed_at DESC
LIMIT 10

Enter fullscreen mode Exit fullscreen mode

This shows the last 10 operations: how many files were added or removed per commit.

Time Travel

How snapshots enable querying the table at any point in its history

Metadata tables enable time travel queries. Use the snapshot list to find the snapshot ID for a specific point in time, then query the table at that snapshot:

-- Query the table as it existed on February 15
SELECT * FROM analytics.orders
AT SNAPSHOT '1234567890123456789'

-- Or by timestamp
SELECT * FROM analytics.orders
AT TIMESTAMP '2024-02-15 00:00:00'

Enter fullscreen mode Exit fullscreen mode

Time travel is useful for debugging data issues ("what did this table look like before yesterday's pipeline ran?"), auditing ("what was the account balance at end-of-quarter?"), and reproducible analysis ("run this report against last month's data").

Incremental Reads

Metadata tables also enable incremental processing. By comparing two snapshots, you can identify which files were added between them and process only the new data:

-- Find files added in the last snapshot
SELECT file_path, record_count
FROM TABLE(table_files('analytics.orders'))
WHERE file_path NOT IN (
  SELECT file_path FROM TABLE(table_files('analytics.orders'))
  AT SNAPSHOT '1234567890'
)

Enter fullscreen mode Exit fullscreen mode

This pattern is the foundation for CDC (Change Data Capture) on Iceberg tables: read only what changed since the last processing run, rather than re-scanning the entire table.

Rollback

If a bad write corrupts your table, use the snapshot list to rollback:

-- Find the last good snapshot
SELECT snapshot_id, committed_at, operation
FROM TABLE(table_snapshot('analytics.orders'))
ORDER BY committed_at DESC

-- Rollback to it (Spark)
CALL system.rollback_to_snapshot('analytics.orders', 1234567890)

Enter fullscreen mode Exit fullscreen mode

Rollback does not delete data. It simply changes the current snapshot pointer to an earlier snapshot, making the table appear as it was at that point. The rolled-back data files remain in storage for potential recovery.

Dremio supports all Iceberg metadata table queries through its TABLE() function syntax and provides time travel in both SQL and its semantic layer.

Building a Health Dashboard

Combine metadata table queries into a scheduled monitoring job:

-- Table health summary
SELECT
  (SELECT COUNT(*) FROM TABLE(table_snapshot('analytics.orders'))) AS snapshots,
  (SELECT COUNT(*) FROM TABLE(table_files('analytics.orders'))) AS files,
  (SELECT AVG(file_size_in_bytes)/1048576 FROM TABLE(table_files('analytics.orders'))) AS avg_mb,
  (SELECT COUNT(*) FROM TABLE(table_manifests('analytics.orders'))) AS manifests

Enter fullscreen mode Exit fullscreen mode

Set alerts when snapshots exceed 1,000, average file size drops below 64 MB, or manifest count exceeds 500.

Engine Syntax Variations

Different engines use different syntax for metadata tables:

Engine Syntax Variations

The underlying data is identical; only the SQL syntax differs. Regardless of which engine you use, these metadata tables are the key diagnostic tool for understanding and maintaining Iceberg table health.

Automating Decisions with Metadata

You can use metadata table queries to drive automated maintenance decisions. For example, a scheduler can check whether compaction is needed before running it:

-- Only compact if average file size is below threshold
SELECT CASE
  WHEN AVG(file_size_in_bytes) / 1048576 < 64 THEN 'COMPACT_NEEDED'
  ELSE 'HEALTHY'
END AS table_status
FROM TABLE(table_files('analytics.orders'))

Enter fullscreen mode Exit fullscreen mode

This avoids running compaction on tables that are already well-organized, saving compute costs and preventing unnecessary data rewrites.

For production environments, integrate these checks into your orchestration tool (Airflow, Dagster, Prefect). Schedule a daily metadata scan across all tables, collect the health metrics, and trigger maintenance jobs only for tables that need them. This approach scales to hundreds of tables without manual oversight. Dremio's autonomous optimization automates this entire workflow for tables managed by Open Catalog.

Part 12 covers using Iceberg from Python and MPP query engines.

Books to Go Deeper

Free Resources