惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Proofpoint News Feed
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Attack and Defense Labs
Attack and Defense Labs
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
W
WeLiveSecurity
O
OpenAI News
SecWiki News
SecWiki News
博客园 - Franky
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
T
Tor Project blog
Microsoft Security Blog
Microsoft Security Blog
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
P
Privacy & Cybersecurity Law Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
月光博客
月光博客
李成银的技术随笔
Spread Privacy
Spread Privacy
F
Full Disclosure
F
Fortinet All Blogs
T
The Exploit Database - CXSecurity.com
Vercel News
Vercel News
AWS News Blog
AWS News Blog
WordPress大学
WordPress大学
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
V
Visual Studio Blog
J
Java Code Geeks
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Last Week in AI
Last Week in AI
P
Palo Alto Networks Blog
宝玉的分享
宝玉的分享
T
True Tiger Recordings
N
News and Events Feed by Topic
酷 壳 – CoolShell
酷 壳 – CoolShell
Cisco Talos Blog
Cisco Talos Blog
N
News | PayPal Newsroom
S
SegmentFault 最新的问题
Jina AI
Jina AI

DEV Community

Mumbli – my personal Wispr Flow Getting Paid Should Not Be a Geopolitical Nightmare: My NOWPayments Integration Story Four Layers of Validation in Kubernetes with Claude Code Prompt Flow — a visual side project for flow design, trace, and integration steps (looking for feedback) AI Citation Registry: Temporal Gaps in Government Publishing Cycles ShowDev: I built a 100% local, zero-upload PDF editor using WebAssembly Written by an AI Pipeline, Verified by Three Models. Is It Slop? Part1 Vulkan: Drawing Triangle 1 Why I Stopped Using useEffect to Sync State — and What I Use Instead Por qué dejé de usar useEffect para sincronizar estado y qué uso ahora Migrating a Long-Running WordPress Site to Payload CMS (And All The Chaos That Came With It) Azure DevOps Structure Explained: Organizations, Projects, and Repos Without the Mess A Simple React Hook for localStorage State, Expiry, and Sync I sold you on /scratchpad. Then I migrated to /note. Fixing WSL Errors on Windows 11 Your app is not Netflix. Stop building like it is. Resolving inter-service communication issue I built an email cleaner. CSV parsing took longer than the actual validators. How I Would Learn Full-Stack Development in 2026 If I Started From Zero Partition Evolution: Change Your Partitioning Without Rewriting Data What Google Play's I/O 2026 Updates Look Like From a Solo Indie Puzzle Developer Forgetting the Myth of "Ease of Integration" When Selling Digital Products with Bitcoin My 4-Step Regex Debugging Workflow (That Actually Saves Time) Stop Scraping Betting Sites: How to Build a Real-Time Sports Tracker in Python Civic Identity and Responsibility in Modern Democracy OLTP vs OLAP Are binaries really executable code ? The lie of the 80%: why software progress charts don't work What a Datacenter in Space Actually Buys You: Three Server Racks Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't Accessibility - This looks like a job for a developer advocate! I built a Mac app that turns web pages into live widgets How to Teach Source Evaluation When Your Students Use ChatGPT More Context Does Not Mean More Trust RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase Past the JVM Design decisions behind my “Irregular German Verbs” iOS app WordPress 7.0 "Armstrong" Is Live — Post-Release Deep Dive 🎺 Performance and Apache Iceberg's Metadata I Shipped a Bug to Production That Cost Us 3 Hours of Downtime 程序人生:在代码与时间之间 The Wrong Way to Think About XRPL Event Infrastructure What I Learned About MND, Voice Banking, and Why Assistive Tech Is Personal $1.50/Month Email Infrastructure That Beats Your $20 SendGrid Plan Cloud Unit Economics: The Metrics DevOps and FinOps Teams Actually Need Bypassing Payment Platform Restrictions Was The Best Decision I Ever Made For My Digital Product Business The Hidden Life of a Container: A Complete Lifecycle When a port is already in use, there is no interactive way to find it — so I built `port-peek` Como Sumir com o Barulho do Teclado Mecânico no Ubuntu Usando o NoiseTorch Google I/O 2026 dropped a bomb on Android tooling, and nobody's talking about it (or maybe they are 😅) Mentoring Junior Developers: What Actually Works How I Prevented Claude Code from Breaking My Architecture with 18 Tests That Run in 0.4 Seconds I Controlled an ESP32 Drone Using Only My Voice vite HMR is silently the reason ur laptop fan wont stop AI Agents Security for Developers: Don't Let Your Agents Become a Liability Single List Keyboard Handling 9 SaaS development companies worth knowing (a technical look) Material Nova — The Best VS Code Theme of 2026 Inference Routing Is Becoming an Infrastructure Placement Problem I just build a League MBTI Analytics Why I Built My Own Site with Astro, Not WordPress when I use WordPress for a Living Hello! I'm a balloon artist who started 3D modeling 7 Next.js 16 Caching Bugs That Compile Fine and Break Silently in Production I got tired of writing READMEs so I built a tool that generates them from your GitHub URL FrontGate: a Lightweight Package Proxy for Supply Chain Security Why Your Expense Tracking Architecture Keeps Breaking Stop your AI trading agent from hallucinating technical analysis Breaking the Monorepo Barrier in a Crypto Store for Digital Products Imposter Syndrome Is Something We All Struggle With at Some Point in Our Careers Moving Beyond the Black Box: How I Built a Real-Time Voice Fitness Coach using Next.js 15, Convex, & Vapi.ai How to Recover Kafka DLQ Messages After a Schema Change Broke Your Consumer From Spec-Driven Development to Attractor-Guided Engineering Githubster free tool to track your GitHub followers and unfollowers Why Bitcoin Core RPC is Too Slow for High-Frequency Trading (And How to Fix It) Why Reading Food Labels Shouldn't Feel Like Decoding a Chemistry Exam I built a "brain" for AI coding agents — it never forgets and never stops How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration) Controlling Employee AI Usage on Managed Devices: Browser Controls, Cloudflare AI Gateway, and AWS Bedrock When Global Payment Gateways Fail, Local Solutions Shine LeetCode Solution: 13. Roman to Integer End-to-End Observability for vLLM and TGI: from DCGM to Tokens LeetCode Solution: 12. Integer to Roman 🚀 A Beginner’s First Look at Project IDX: Secure Coding from Day One Team Topologies for DevOps: A Practical Implementation Guide Seven Contradictions Shaped an Architecture. Telemedicine in Venezuela: A Technical Guide for Clinics in 2026 SSO, SAML, OIDC, and SCIM: What Actually Happens When You Click "Sign in with Google" Mastering Next.js 16 Server Actions & Forms: The Future of Full-Stack React | Muhammad Arslan Enterprise Laravel API Development: Best Practices for Performance, Security, and Scale | Muhammad Arslan How I Turned an Image Into a 3D Model in Minutes With AI Why Pure Rust WASM Is Harder Than It Looks Platform Stores Are a Dead End for Crypto Payments The VLA Testing Pipeline in Mano-AFK: When AI Agents QA Their Own Work LeetCode Solution: 10. Regular Expression Matching IPv4 Geolocation and Leasing: A Practical Guide for Network Operators Reconciling the Inefficiencies of Global Crypto Payments Platforms I Exported HT-Demucs FT to ONNX in 2026 (4 Blockers Everyone Else Gave Up On) 🤖 The Hacker in the Machine: Using AI Agents to Build Interactive Security Games Savings Plan Amortized Cost in AWS Cost Explorer: What It Is and How to Use It How to Tailor Your Resume to a Job Description in 5 Minutes (A Method That Actually Works)
Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans
Alex Merced · 2026-05-21 · via DEV Community

This is Part 5 of a 15-part Apache Iceberg Masterclass. Part 4 covered partition evolution. This article covers hidden partitioning, the feature that ensures users never need to know how their data is physically organized.

The most expensive mistake in data lake querying is the accidental full table scan: a query that reads every file because the user did not correctly reference the partition columns. In Hive, this happens constantly. In Iceberg, it is structurally impossible because users never reference partition columns at all.

Table of Contents

  1. What Are Table Formats and Why Were They Needed?
  2. The Metadata Structure of Current Table Formats
  3. Performance and Apache Iceberg's Metadata
  4. Technical Deep Dive on Partition Evolution
  5. Technical Deep Dive on Hidden Partitioning
  6. Writing to an Apache Iceberg Table
  7. What Are Lakehouse Catalogs?
  8. Embedded Catalogs: S3 Tables and MinIO AI Stor
  9. How Iceberg Table Storage Degrades Over Time
  10. Maintaining Apache Iceberg Tables
  11. Apache Iceberg Metadata Tables
  12. Using Iceberg with Python and MPP Engines
  13. Streaming Data into Apache Iceberg Tables
  14. Hands-On with Iceberg Using Dremio Cloud
  15. Migrating to Apache Iceberg

The Accidental Full Scan Problem

Exposed partitioning in Hive versus hidden partitioning in Iceberg showing the same pruning with different user experience

In Hive, a table partitioned by year, month, and day requires queries to filter on those exact columns:

-- Hive: This prunes correctly
SELECT * FROM orders WHERE year = 2024 AND month = 3 AND day = 15

-- Hive: This scans EVERYTHING (no pruning)
SELECT * FROM orders WHERE order_date = '2024-03-15'

Enter fullscreen mode Exit fullscreen mode

The second query reads every partition because Hive does not know that order_date maps to the year, month, and day partition columns. There is no error, no warning. The query simply runs 100x slower than it should.

This happens because Hive partitioning is "exposed." The physical partition columns (year, month, day) are separate from the source column (order_date). Users must understand this mapping and construct their filters accordingly.

How Iceberg Hides Partitioning

Iceberg flips this model. Users filter on the source column (order_date), and the engine automatically maps the filter to the partition values using transform functions.

-- Iceberg: This prunes correctly. Always.
SELECT * FROM orders WHERE order_date = '2024-03-15'

Enter fullscreen mode Exit fullscreen mode

The table's partition spec declares: PARTITIONED BY (day(order_date)). When the engine processes this query, it:

  1. Reads the partition spec from the table metadata
  2. Applies the day() transform to the filter value: day('2024-03-15') = 2024-03-15
  3. Checks manifest entries for files with matching partition values
  4. Skips every file whose partition value is not 2024-03-15

The user writes natural SQL against the source columns. The engine handles the physical-to-logical mapping. This is why it is called "hidden" partitioning: the partition structure is invisible to the user.

The Six Transform Functions

Iceberg's six partition transform functions showing how each maps source values to partition values

Iceberg defines six partition transforms that map source column values to partition values:

Temporal Transforms

Temporal Transforms

The temporal transforms are hierarchical. If a table is partitioned by day(ts) and a user filters WHERE ts >= '2024-03-01' AND ts < '2024-04-01', the engine recognizes this as a range of days and prunes to only the 31 matching partitions. Engines like Dremio handle this mapping automatically for equality, range, and IN-list predicates.

Value Transforms

Value Transforms

truncate(N, col) takes the first N characters of a string (or truncates a number to a width). This is useful when you want to group data by a string prefix without creating one partition per unique value.

bucket(N, col) applies a hash function and mod N to produce a bucket number from 0 to N-1. This distributes data evenly across a fixed number of buckets, regardless of the column's value distribution. It is the go-to transform for high-cardinality columns like user_id or order_id where identity partitioning would create millions of tiny partitions.

The Identity Transform

The identity transform (identity(col)) uses the raw column value as the partition value. This is equivalent to Hive-style partitioning, but the column is still "hidden" because the engine handles the mapping. It is useful for low-cardinality columns like region or status where each unique value should be its own partition.

How Pruning Works Under the Hood

Step-by-step flow showing how the engine maps a user query through the partition spec to prune files

The pruning process works in three phases:

Phase 1: Predicate translation. The engine examines each WHERE clause predicate and checks if the filtered column is part of the partition spec. If order_date is the source column for day(order_date), the engine can translate order_date = '2024-03-15' into a partition filter.

Phase 2: Manifest list evaluation. The manifest list stores partition value ranges per manifest. The engine checks if each manifest's range includes the target partition value. Manifests whose range does not overlap are skipped entirely.

Phase 3: Manifest entry evaluation. For each surviving manifest, the engine checks individual file entries. Only files whose partition value matches 2024-03-15 are included in the scan plan.

This is the same pruning cascade described in Part 3, but now the partition values were derived automatically from the user's filter on a source column.

Choosing the Right Transform

The choice of partition transform depends on data volume and query patterns:

Choosing the Right Transform

The goal is to create partitions that are large enough to contain optimally-sized files (128-512 MB each) but small enough that partition pruning eliminates most files for typical queries.

Over-partitioning (too many small partitions) creates the small file problem: thousands of tiny files that bloat metadata and slow query planning. Under-partitioning (too few large partitions) reduces pruning effectiveness because each partition contains too much data.

Combining Transforms

Iceberg supports multi-column partition specs:

CREATE TABLE events (
  event_id BIGINT,
  event_time TIMESTAMP,
  user_id BIGINT,
  event_type STRING
) PARTITIONED BY (day(event_time), bucket(32, user_id))

Enter fullscreen mode Exit fullscreen mode

This creates a two-dimensional partition space: each combination of day and user bucket is a separate partition. Queries filtering on event_time get day-level pruning. Queries filtering on user_id get bucket-level pruning. Queries filtering on both get pruning from both dimensions.

Dremio supports all Iceberg transform functions and automatically applies pruning for any combination of partition columns in the query's WHERE clause.

Why This Matters for Teams

Hidden partitioning changes the operational model for data teams:

Data engineers define the partition strategy once in the table's partition spec. They can change it later through partition evolution without breaking anything.

Analysts and data scientists write natural SQL against the business columns they understand. They never need to know whether the table is partitioned by day, month, or bucket. Their queries are automatically optimized.

BI tools and dashboards connect to Iceberg tables and issue standard SQL. The tools do not need to understand Iceberg's partitioning; the engine handles the optimization. This is why hidden partitioning is essential for self-service analytics platforms like Dremio.

The net result: no accidental full table scans, no partition-aware query patterns required from users, and the ability to change the physical layout without impacting any downstream consumer. Part 6 covers what happens when data is written to an Iceberg table.

Books to Go Deeper

Free Resources