惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

W
WeLiveSecurity
D
DataBreaches.Net
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
T
The Exploit Database - CXSecurity.com
D
Darknet – Hacking Tools, Hacker News & Cyber Security
腾讯CDC
PCI Perspectives
PCI Perspectives
阮一峰的网络日志
阮一峰的网络日志
S
Security Archives - TechRepublic
Hugging Face - Blog
Hugging Face - Blog
U
Unit 42
IT之家
IT之家
T
Troy Hunt's Blog
P
Proofpoint News Feed
www.infosecurity-magazine.com
www.infosecurity-magazine.com
F
Full Disclosure
V
V2EX
Stack Overflow Blog
Stack Overflow Blog
C
Comments on: Blog
V
Vulnerabilities – Threatpost
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
V
V2EX - 技术
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
N
News | PayPal Newsroom
MyScale Blog
MyScale Blog
Google DeepMind News
Google DeepMind News
Application and Cybersecurity Blog
Application and Cybersecurity Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
李成银的技术随笔
P
Privacy & Cybersecurity Law Blog
大猫的无限游戏
大猫的无限游戏
V
Visual Studio Blog
T
ThreatConnect
WordPress大学
WordPress大学
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA
Recent Announcements
Recent Announcements
Google DeepMind News
Google DeepMind News
SecWiki News
SecWiki News
Recorded Future
Recorded Future
小众软件
小众软件
K
Kaspersky official blog
T
Tor Project blog
Last Week in AI
Last Week in AI
GbyAI
GbyAI
人人都是产品经理
人人都是产品经理
Jina AI
Jina AI
S
SegmentFault 最新的问题
MongoDB | Blog
MongoDB | Blog
Simon Willison's Weblog
Simon Willison's Weblog

DEV Community

Choosing the Right Treasure Map to Avoid Data Decay in Veltrix Stop Reviewing Every Line of AI Code - Build the Trust Stack Instead Implementation of AI in mobile applications: Comparative analysis of On-Device and On-Server approaches on Native Android and Flutter Should you use Gemma 4 for your Development? A Multiversal Analysis to Determine if Gemma 4 is Right for You! The Rising Trend of Creative Interview Questions in Tech I Spent Hours Fighting a Silent Subnet Conflict to Build an Isolated ICS Security Lab (And What It Taught Me About the Linux Kernel) It Worked When I Closed the Laptop. I Swear. We Built an Agent That Flags Fake Internships #kryx Your Personal AI Stack Is the New Dotfiles Your LLM Bill Is Exploding Because of Architecture, Not Pricing -- Here's the Fix How We Prevent Attendance Fraud Using GPS Verification AI Code Review in 2026: How the Tools Actually Differ (A Builder's Field Guide) From Problems to Patterns: Generative AI in .Net (C#) GemmaOps Edge: From 373 Alarms to 1 Root Cause Using Local AI (Gemma 4) Building an Amazon EKS Security Baseline Hands-On with Apache Iceberg Using Dremio Cloud 🤫 Firebase Is Quietly Preparing for an Offline-First AI Future Should Angular Apps Still Rely on RxJS in 2025? Gaslighting Gemma 4: Can Open-Weight Reasoning Models Withstand a Confident Liar? AI Workflow Automation Needs More Than Another Script Reviving Cineverse: From Local Storage to Firebase 🚀 Approaches to Streaming Data into Apache Iceberg Tables How to Add Rounded Corners to an Image Online The subtle impact of AI (&amp; IT) on jobs Made a Rust based AI agent Your AI is not bad, your instructions are What Clicked for Me After Building on Solana for a Few Days WhatsApp's Encryption Stack: What It Covers, What It Doesn't, and What a Federal Agent Spent 10 Months Investigating Building CogniPlan: A Local-First Task Planning System Using Apache Iceberg with Python and MPP Query Engines How I Built AegisDesk: A Zero-Token Semantic IT Agent with <5ms Latency I built CodeArchy: an open-source that turns any codebase into a visual, explainable architectural experience, powered by Gemma 4. The Day Our Bot Ran Out of Money How we're using Gemini Embeddings to build a smarter, community-driven feed on DEV The Speculative Decoding Pattern The PKCE "Gotcha" in Expo’s exchangeCodeAsync TharVA : Keeping India's Desert Heritage Alive with Offline AI (Gemma4) n8n for Healthcare: 5 Automations for Clinics, Practices, and Health Tech Teams (Free Workflow JSON) How I Built an OWASP Memory Guard for AI Agents (ASI06) Condition-Based vs Time-Based Maintenance: Making the Switch I Tested Spam Protection on Formspree vs Formgrid. The Results Were Surprising. May 27 - Video Understanding Workshop Beyond Keywords: How Google's 2026 Algorithms are Redefining SEO From Click to Cart: Ensuring an Accessible Customer Journey in WooCommerce Your company won't replace you with good AI. They'll replace you with bad AI. How to Use an SVG Icon Search Engine as a Claude Custom Connector O fim do “modelo que faz tudo”? Conheça o Conductor, a IA que orquestra outras IAs 10 First-Principles Strategies to Learn Any Programming Language Deeply 10 First-Principles Strategies to Learn Any Programming Language Deeply Understanding Embeddings easily. The Hidden Cost of “Move Fast and Break Things” Why Your Logs Are Useless Without Traces DressCode: Your AI Stylist for Tomorrow The Documented Shortcoming of Our Production Treasure Hunt Engine I'm 16, and I Built an AI Tool That Audits Your Technical Debt Without Ever Touching code Building Your Own Crypto Poker Bot: A Developer's Guide to Blockchain Gaming Logic Apache Iceberg Metadata Tables: Querying the Internals Hermes, The Self-Improving Agent You Can Actually Run Yourself Unity vs Unreal: 5 Things I Had to Relearn the Hard Way Building Agentic Commerce Infrastructure: Overcoming SQLite Concurrency for Autonomous Procurement Agents Solana Accounts vs Databases HTML Table Borders I built a skill that makes AI-generated AWS diagrams actually usable My first post! I'm kinda excited The Page Root Was the Wrong Unit How to audit what your IDE extension actually sends to the cloud I Migrated 23 Make.com Scenarios to n8n and Cut My Bill by 60% — Complete Migration Guide (2026) Solving a Logistics Problem Using Genetic Algorithms Claude Code Skills Explained: What They Are & When to Use Them (2026) Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers We scanned 8 B2B SaaS companies across 5 categories. ChatGPT named the same 12 brands in every answer. How To "Market" Yourself As A Tech Pro We scanned 500 MCP servers on Smithery. Here is what we found. HTML Basics for Beginners – Markup Language, Elements and Types of CSS DiffWhisperer: How I Turned Cryptic Git Diffs into Architectural Stories with Gemma 4 I built a version manager for llama.cpp using nothing but vibe coding. Unit Testing vs System Testing: Key Differences, Use Cases, and Best Practices for 2026 A game design textbook explains why products with fewer features win How to Build a Raydium Launchpad Bonding Curve in 5 Minutes with forgekit How to turn an AI prototype into a production system How Data Lake Table Storage Degrades Over Time Partition and Sort Keys on DynamoDB: Modeling data for batch-and-stream convergence Auto-Generate Optimized GitHub Actions Workflows For Any Stack With This New CLI Tool Unchaining the African Creator Economy The Treasure Hunt Engine Gotcha - A Lesson in Constrained Performance great_cto v2.17 - no more tambourine dance When Catalogs Are Embedded in Storage SafeMind AI: Instant Health & Safety Intelligence What Is PKCE, How It Works & Flow Examples AI Agent Failure Modes Beyond Hallucination Fastest Way to Understand Stryker Solana Accounts Explained to a Web2 Developer TV Yayın Akışı Sitesi Geliştirirken Öğrendiğim Teknik Dersler $500 Challenge Drop My First Look at Google's Gemma 4: A Quick Introduction How I use an LLM as a translation judge Best Calendar and Scheduling API for Developers — 2026 Comparison Agentic AI in Travel: Why UCP Isn't Travel-Ready Yet — and What We Measured I Finished Machine Learning. And Then Changed The Plan.
Migrating to Apache Iceberg: Strategies for Every Source System
Alex Merced · 2026-05-23 · via DEV Community

This is Part 15, the final article of a 15-part Apache Iceberg Masterclass. Part 14 covered hands-on Dremio Cloud. This article covers the three migration strategies and how to execute a zero-downtime migration using the view swap pattern.

Most organizations do not start with Iceberg. They have years of data in Hive tables, data warehouses, CSV files, databases, and Parquet directories. Moving this data to Iceberg is not an all-or-nothing project. The best migrations happen incrementally, one dataset at a time, with no disruption to existing consumers.

Table of Contents

  1. What Are Table Formats and Why Were They Needed?
  2. The Metadata Structure of Current Table Formats
  3. Performance and Apache Iceberg's Metadata
  4. Technical Deep Dive on Partition Evolution
  5. Technical Deep Dive on Hidden Partitioning
  6. Writing to an Apache Iceberg Table
  7. What Are Lakehouse Catalogs?
  8. Embedded Catalogs: S3 Tables and MinIO AI Stor
  9. How Iceberg Table Storage Degrades Over Time
  10. Maintaining Apache Iceberg Tables
  11. Apache Iceberg Metadata Tables
  12. Using Iceberg with Python and MPP Engines
  13. Streaming Data into Apache Iceberg Tables
  14. Hands-On with Iceberg Using Dremio Cloud
  15. Migrating to Apache Iceberg

Three Migration Strategies

Three paths to Iceberg: in-place migration, full rewrite, and shadow migration

1. In-Place Migration (Metadata Only)

In-place migration creates Iceberg metadata over existing Parquet or ORC files without copying or moving them. The data files stay exactly where they are; only new Iceberg metadata is created to track them.

Spark example:

CALL system.migrate('db.existing_hive_table')

Enter fullscreen mode Exit fullscreen mode

This converts a Hive table to Iceberg by scanning its files and creating the Iceberg metadata tree (metadata.json, manifest list, manifest files) that references them. The Parquet files are untouched.

Pros: Fast. No data movement. The table becomes queryable as Iceberg immediately.

Cons: The existing file layout (sizes, partitioning, sort order) is inherited. If the original files are poorly organized, you inherit those problems. Requires the original files to be in Parquet or ORC format.

2. Full Rewrite (CTAS)

A full rewrite reads data from any source and writes it as a new Iceberg table with optimal partitioning and file sizes:

-- Spark
CREATE TABLE iceberg_catalog.analytics.orders
USING iceberg
PARTITIONED BY (day(order_date))
AS SELECT * FROM hive_catalog.legacy.orders

-- Dremio
CREATE TABLE analytics.orders
PARTITION BY (day(order_date))
AS SELECT * FROM legacy_source.public.orders

Enter fullscreen mode Exit fullscreen mode

Pros: Best result. Optimal file sizes, correct sort order, proper partitioning. The table is perfectly organized from day one.

Cons: Requires reading and writing all data, which takes time and compute resources. The source system must be available during the migration.

3. Shadow Migration (Build and Swap)

Shadow migration builds the Iceberg table alongside the existing source, then swaps consumers from old to new when ready:

  1. Create a new Iceberg table with the desired schema and partitioning
  2. Backfill historical data from the legacy source
  3. Set up incremental sync to keep the Iceberg table current
  4. Validate data quality between old and new
  5. Swap consumer views from legacy to Iceberg

Pros: Zero downtime. Consumers never see a disruption. You can validate the migration before committing to it.

Cons: Temporarily doubles storage costs. Requires maintaining two copies during the transition.

Choosing the Right Strategy

Decision tree for selecting the right migration strategy based on downtime tolerance and layout changes

Source Recommended Strategy
Hive table (Parquet files) In-place migration, then compact
Data warehouse (Snowflake, Redshift) Full rewrite via Dremio federation
CSV/JSON files in S3 Full rewrite with COPY INTO
PostgreSQL/MySQL Full rewrite or shadow migration
Delta Lake tables In-place conversion or rewrite
Production system (no downtime) Shadow migration with view swap

The View Swap Pattern

The zero-downtime view swap pattern: views point to legacy first, then switch to Iceberg

The view swap pattern is the recommended approach for production migrations. It uses Dremio's semantic layer to create an abstraction between consumers and the underlying data:

Phase 1: Federation

Create views in Dremio that point to the legacy data source:

CREATE VIEW analytics.orders AS
SELECT order_id, customer_id, order_date, amount, status, region
FROM postgres_source.public.orders

Enter fullscreen mode Exit fullscreen mode

All consumers (dashboards, reports, notebooks) query through these views. They do not know or care where the data physically lives.

Phase 2: Build Iceberg

Create and populate the Iceberg table:

-- Create the Iceberg table
CREATE TABLE iceberg_data.analytics.orders (
    order_id BIGINT, customer_id BIGINT,
    order_date DATE, amount DECIMAL(10,2),
    status VARCHAR, region VARCHAR
) PARTITION BY (day(order_date))

-- Backfill from the legacy source
INSERT INTO iceberg_data.analytics.orders
SELECT * FROM postgres_source.public.orders

Enter fullscreen mode Exit fullscreen mode

Phase 3: Validate

Compare the two datasets to confirm data integrity:

SELECT
  (SELECT COUNT(*) FROM postgres_source.public.orders) AS legacy_count,
  (SELECT COUNT(*) FROM iceberg_data.analytics.orders) AS iceberg_count

Enter fullscreen mode Exit fullscreen mode

Beyond row counts, validate aggregates (total amounts, distinct customer counts) and spot-check individual records. A comprehensive validation script should compare:

  • Total row count
  • Column-level checksums or hash aggregates
  • Distinct value counts for key columns
  • Boundary values (MIN/MAX) for numeric and date columns
  • Sample of specific records matched by primary key

Only proceed to the swap after all validation checks pass.

Phase 4: Swap

Update the view to point to the Iceberg table:

CREATE OR REPLACE VIEW analytics.orders AS
SELECT order_id, customer_id, order_date, amount, status, region
FROM iceberg_data.analytics.orders

Enter fullscreen mode Exit fullscreen mode

Consumers notice nothing. The view name is the same. The query interface is the same. But now the data is served from Iceberg with all of its advantages: time travel, hidden partitioning, metadata-driven pruning, and automatic optimization.

Migrating One Table at a Time

The view swap pattern enables incremental migration. You do not need to migrate everything at once:

  1. Week 1: Migrate the highest-value table (e.g., orders)
  2. Week 2: Migrate the next table (e.g., customers)
  3. Continue until all critical tables are on Iceberg

During the transition, Dremio's federation queries legacy and Iceberg tables together. A join between a PostgreSQL table and an Iceberg table works the same as a join between two Iceberg tables. The migration is invisible to consumers.

Post-Migration Checklist

After migrating each table:

Common Migration Pitfalls

Migrating without testing query performance: Always benchmark critical queries against the new Iceberg table before switching production traffic. Iceberg's partition layout and file organization affect performance, and a migration can make some queries faster but others slower if the partition strategy is wrong.

Skipping the validation phase: Data discrepancies between the old and new systems are more common than expected. Schema differences, timezone handling, null semantics, and data type precision can all cause subtle mismatches. Validate thoroughly.

Migrating everything at once: Large "big bang" migrations carry high risk. If something goes wrong, rolling back is complex and time-consuming. Migrate one table at a time, validate each one, and build confidence incrementally.

This completes the Apache Iceberg Masterclass. The series covered table formats, metadata, performance, partitioning, writes, catalogs, maintenance, tooling, and migration. For hands-on practice, start a Dremio Cloud trial and follow the workflow in Part 14.

Books to Go Deeper

Free Resources