惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Proofpoint News Feed
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Attack and Defense Labs
Attack and Defense Labs
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
W
WeLiveSecurity
O
OpenAI News
SecWiki News
SecWiki News
博客园 - Franky
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
T
Tor Project blog
Microsoft Security Blog
Microsoft Security Blog
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
P
Privacy & Cybersecurity Law Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
月光博客
月光博客
李成银的技术随笔
Spread Privacy
Spread Privacy
F
Full Disclosure
F
Fortinet All Blogs
T
The Exploit Database - CXSecurity.com
Vercel News
Vercel News
AWS News Blog
AWS News Blog
WordPress大学
WordPress大学
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
V
Visual Studio Blog
J
Java Code Geeks
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Last Week in AI
Last Week in AI
P
Palo Alto Networks Blog
宝玉的分享
宝玉的分享
T
True Tiger Recordings
N
News and Events Feed by Topic
酷 壳 – CoolShell
酷 壳 – CoolShell
Cisco Talos Blog
Cisco Talos Blog
N
News | PayPal Newsroom
S
SegmentFault 最新的问题
Jina AI
Jina AI

DEV Community

A Simple React Hook for localStorage State, Expiry, and Sync I sold you on /scratchpad. Then I migrated to /note. Fixing WSL Errors on Windows 11 Your app is not Netflix. Stop building like it is. Resolving inter-service communication issue I built an email cleaner. CSV parsing took longer than the actual validators. How I Would Learn Full-Stack Development in 2026 If I Started From Zero What Google Play's I/O 2026 Updates Look Like From a Solo Indie Puzzle Developer Forgetting the Myth of "Ease of Integration" When Selling Digital Products with Bitcoin My 4-Step Regex Debugging Workflow (That Actually Saves Time) Stop Scraping Betting Sites: How to Build a Real-Time Sports Tracker in Python Civic Identity and Responsibility in Modern Democracy OLTP vs OLAP Are binaries really executable code ? The lie of the 80%: why software progress charts don't work What a Datacenter in Space Actually Buys You: Three Server Racks Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't Accessibility - This looks like a job for a developer advocate! I built a Mac app that turns web pages into live widgets How to Teach Source Evaluation When Your Students Use ChatGPT More Context Does Not Mean More Trust RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase Past the JVM Design decisions behind my “Irregular German Verbs” iOS app WordPress 7.0 "Armstrong" Is Live — Post-Release Deep Dive 🎺 Performance and Apache Iceberg's Metadata I Shipped a Bug to Production That Cost Us 3 Hours of Downtime 程序人生:在代码与时间之间 The Wrong Way to Think About XRPL Event Infrastructure What I Learned About MND, Voice Banking, and Why Assistive Tech Is Personal $1.50/Month Email Infrastructure That Beats Your $20 SendGrid Plan Cloud Unit Economics: The Metrics DevOps and FinOps Teams Actually Need Bypassing Payment Platform Restrictions Was The Best Decision I Ever Made For My Digital Product Business The Hidden Life of a Container: A Complete Lifecycle When a port is already in use, there is no interactive way to find it — so I built `port-peek` Como Sumir com o Barulho do Teclado Mecânico no Ubuntu Usando o NoiseTorch Google I/O 2026 dropped a bomb on Android tooling, and nobody's talking about it (or maybe they are 😅) Mentoring Junior Developers: What Actually Works How I Prevented Claude Code from Breaking My Architecture with 18 Tests That Run in 0.4 Seconds I Controlled an ESP32 Drone Using Only My Voice vite HMR is silently the reason ur laptop fan wont stop AI Agents Security for Developers: Don't Let Your Agents Become a Liability Single List Keyboard Handling 9 SaaS development companies worth knowing (a technical look) Material Nova — The Best VS Code Theme of 2026 Inference Routing Is Becoming an Infrastructure Placement Problem I just build a League MBTI Analytics Why I Built My Own Site with Astro, Not WordPress when I use WordPress for a Living Hello! I'm a balloon artist who started 3D modeling 7 Next.js 16 Caching Bugs That Compile Fine and Break Silently in Production I got tired of writing READMEs so I built a tool that generates them from your GitHub URL FrontGate: a Lightweight Package Proxy for Supply Chain Security Why Your Expense Tracking Architecture Keeps Breaking Stop your AI trading agent from hallucinating technical analysis Breaking the Monorepo Barrier in a Crypto Store for Digital Products Imposter Syndrome Is Something We All Struggle With at Some Point in Our Careers Moving Beyond the Black Box: How I Built a Real-Time Voice Fitness Coach using Next.js 15, Convex, & Vapi.ai How to Recover Kafka DLQ Messages After a Schema Change Broke Your Consumer From Spec-Driven Development to Attractor-Guided Engineering Githubster free tool to track your GitHub followers and unfollowers Why Bitcoin Core RPC is Too Slow for High-Frequency Trading (And How to Fix It) Why Reading Food Labels Shouldn't Feel Like Decoding a Chemistry Exam I built a "brain" for AI coding agents — it never forgets and never stops How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration) Controlling Employee AI Usage on Managed Devices: Browser Controls, Cloudflare AI Gateway, and AWS Bedrock When Global Payment Gateways Fail, Local Solutions Shine LeetCode Solution: 13. Roman to Integer End-to-End Observability for vLLM and TGI: from DCGM to Tokens LeetCode Solution: 12. Integer to Roman 🚀 A Beginner’s First Look at Project IDX: Secure Coding from Day One Team Topologies for DevOps: A Practical Implementation Guide Seven Contradictions Shaped an Architecture. Telemedicine in Venezuela: A Technical Guide for Clinics in 2026 SSO, SAML, OIDC, and SCIM: What Actually Happens When You Click "Sign in with Google" Mastering Next.js 16 Server Actions & Forms: The Future of Full-Stack React | Muhammad Arslan Enterprise Laravel API Development: Best Practices for Performance, Security, and Scale | Muhammad Arslan How I Turned an Image Into a 3D Model in Minutes With AI Why Pure Rust WASM Is Harder Than It Looks Platform Stores Are a Dead End for Crypto Payments The VLA Testing Pipeline in Mano-AFK: When AI Agents QA Their Own Work LeetCode Solution: 10. Regular Expression Matching IPv4 Geolocation and Leasing: A Practical Guide for Network Operators Reconciling the Inefficiencies of Global Crypto Payments Platforms I Exported HT-Demucs FT to ONNX in 2026 (4 Blockers Everyone Else Gave Up On) 🤖 The Hacker in the Machine: Using AI Agents to Build Interactive Security Games Savings Plan Amortized Cost in AWS Cost Explorer: What It Is and How to Use It How to Tailor Your Resume to a Job Description in 5 Minutes (A Method That Actually Works) Flutter vs React Native in 2026: I Built the Same App in Both JWT vs Session Tokens in Spring Boot: A Senior Dev's Decision Guide How to Choose an AI Gateway in 2026 How to Teach Source Evaluation When Your Students Use ChatGPT Why Passwordless B2C Rollouts Stall at 5% (and How to Reach 60%) Rmux Review: Rust Terminal Multiplexer Built for AI Agents I realized I was only using half of what Claude Code has to offer DevOps & Deployment Essentials: Your Practical CI/CD Guide How next-generation captchas work and why it matters for automation Chat is Dead: How JSON Prompting Cut My AI Costs by 73% What if Everybody Were Suddenly... Better? OCI Web Application Firewall (WAF) Deep Dive: Architecture, Traffic Inspection, Threat Protection, and Enterprise Security Design Selling Digital Products in a Country PayPal Refuses to Touch
Partition Evolution: Change Your Partitioning Without Rewriting Data
Alex Merced · 2026-05-21 · via DEV Community

This is Part 4 of a 15-part Apache Iceberg Masterclass. Part 3 covered metadata-driven performance. This article explains how Iceberg handles the problem that has plagued data lakes for over a decade: what happens when your partition strategy needs to change.

Partitioning determines how data is physically organized in storage, and it is the single most impactful factor for query performance on large tables. Get it right and queries skip 95% of the data. Get it wrong and every query scans everything. The problem is that requirements change, data volumes grow, and the partition strategy that worked last year becomes a bottleneck this year.

Table of Contents

  1. What Are Table Formats and Why Were They Needed?
  2. The Metadata Structure of Current Table Formats
  3. Performance and Apache Iceberg's Metadata
  4. Technical Deep Dive on Partition Evolution
  5. Technical Deep Dive on Hidden Partitioning
  6. Writing to an Apache Iceberg Table
  7. What Are Lakehouse Catalogs?
  8. Embedded Catalogs: S3 Tables and MinIO AI Stor
  9. How Iceberg Table Storage Degrades Over Time
  10. Maintaining Apache Iceberg Tables
  11. Apache Iceberg Metadata Tables
  12. Using Iceberg with Python and MPP Engines
  13. Streaming Data into Apache Iceberg Tables
  14. Hands-On with Iceberg Using Dremio Cloud
  15. Migrating to Apache Iceberg

The Hive Problem: Partitioning Is Permanent

Hive-style directory-based partitioning with its three core problems

In Hive and other traditional data lake systems, partitions are physical directories. A table partitioned by year and month has a directory structure like:

s3://warehouse/orders/year=2023/month=01/part-0000.parquet
s3://warehouse/orders/year=2023/month=02/part-0000.parquet
...

Enter fullscreen mode Exit fullscreen mode

This design has three fundamental problems:

Changing partitions requires rewriting all data. If a table is partitioned by month and you need daily partitions (because data volume grew and monthly partitions are now too large for efficient queries), you must read every file, re-partition it, and write it back. For a petabyte table, this means a petabyte of compute and I/O, hours of processing, and downtime for consumers.

Users must know the physical layout. Queries must explicitly reference partition columns using the exact partition column names: WHERE year = 2024 AND month = 3. If a user writes WHERE order_date = '2024-03-15', Hive does not recognize that order_date maps to year = 2024, month = 3, and it scans the entire table. This creates a constant burden on users to understand and correctly use the physical layout.

Wrong filters produce silent full scans. There is no error, no warning. The query runs, it just reads every partition. Teams discover the problem only when they notice query times are 50x slower than expected.

How Iceberg Solves This

Iceberg partition evolution showing how old and new partition specs coexist without rewriting data

Iceberg separates the logical partition specification from the physical data layout through two mechanisms: hidden partitioning (covered in Part 5) and partition evolution.

The Partition Spec

Every Iceberg table has a partition spec that defines how source columns map to partition values. The spec does not create directories. Instead, it records partition values as metadata in manifest entries alongside each data file.

-- Create a table partitioned by month
CREATE TABLE orders (
  order_id BIGINT,
  order_date DATE,
  amount DECIMAL(10,2),
  status STRING
) PARTITIONED BY (month(order_date))

Enter fullscreen mode Exit fullscreen mode

When data is written, the engine computes the partition value (month('2024-03-15') = 2024-03) and stores it in the manifest entry for that file. The file itself can live at any path; there is no requirement for a month=2024-03/ directory.

Evolving the Spec

When data volume grows and monthly partitions become too coarse, you change the spec:

ALTER TABLE orders SET PARTITION SPEC (day(order_date))

Enter fullscreen mode Exit fullscreen mode

This is a metadata-only operation. It takes milliseconds. No data is read or rewritten. What happens internally:

  1. The current partition spec (Spec 0: month(order_date)) is preserved in the table's metadata history.
  2. A new partition spec (Spec 1: day(order_date)) is set as the active spec.
  3. All existing data files retain their Spec 0 partition values in their manifest entries.
  4. All new data written to the table uses Spec 1.

The table now contains files with two different partition specs. This is not a broken state. It is the designed behavior.

How Query Planning Handles Multiple Specs

How engines resolve queries across multiple partition specs by evaluating each independently

When a query filters on order_date, the engine must correctly prune files regardless of which spec they were written under. Here is the process:

SELECT * FROM orders
WHERE order_date BETWEEN '2023-12-01' AND '2024-01-31'

Enter fullscreen mode Exit fullscreen mode

For Spec 0 files (monthly): The engine translates the date range into month values: 2023-12 and 2024-01. It checks manifest entries with Spec 0 partition values and keeps files where the month partition is either 2023-12 or 2024-01. All other months are skipped.

For Spec 1 files (daily): The engine translates the date range into day values: 2024-01-01 through 2024-01-31. It checks manifest entries with Spec 1 partition values and keeps files where the day partition falls within that range. All other days are skipped.

Result: Both old and new files are correctly pruned using their respective specs. The query returns accurate results from files written under different partition strategies, without the user knowing or caring about the spec history.

Real-World Scenarios

Growing From Monthly to Daily

The most common evolution. A startup begins with monthly partitions when data volume is 10 GB/month. Two years later, data volume is 500 GB/month and monthly partitions produce files too large for efficient processing. Evolving to daily partitions makes new data more granular while old data remains accessible.

Adding a Partition Column

A table partitioned only by date starts receiving queries that heavily filter by region. Adding a partition on region (using bucket(16, region)) improves pruning for those queries:

ALTER TABLE orders SET PARTITION SPEC (day(order_date), bucket(16, region))

Enter fullscreen mode Exit fullscreen mode

Old files remain partitioned by date only. New files are partitioned by both date and region. Queries that filter on date work correctly for both old and new files. Queries that filter on region get pruning benefits only for new files.

Removing a Partition Column

If a partition column becomes irrelevant (e.g., a geographic region is no longer used for filtering), you can evolve the spec to remove it. Old files keep their partition values, but new files are no longer organized by that column. Dremio and other engines handle this transparently during query planning.

What About the Old Data?

After a partition evolution, old data continues to work correctly but may have suboptimal organization. The old monthly files are coarser than the new daily files, meaning queries against historical data scan larger files than necessary.

Two options:

  1. Leave it alone. If historical data is queried infrequently, the cost of less-optimal pruning is minimal. This is the zero-effort approach.

  2. Compact old data. Run a compaction job that rewrites old files under the new spec. This produces daily-partitioned files for the historical data too, but requires compute resources. Dremio's automatic table optimization can handle this for tables managed by Open Catalog.

How Other Formats Handle This

How Other Formats Handle This

Delta Lake's Liquid Clustering is a different solution to the same problem. Instead of static partitions, it uses adaptive clustering that reorganizes data in the background. The tradeoff: Liquid Clustering requires ongoing background compute, while Iceberg's partition evolution is a one-time metadata operation with optional follow-up compaction.

Partition evolution is one of the features that makes Iceberg a safe long-term choice. It means the partitioning decision you make today is not permanent. Part 5 covers hidden partitioning, the other half of Iceberg's partitioning story.

Books to Go Deeper

Free Resources