Testing Database Migrations in CI: Catching Broken Schema Changes Before They Hit Production

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-06-13 · via The Practical Developer

You wrote a migration. It passed code review. The reviewer checked the SQL syntax, confirmed the column name is not a typo, and approved the PR. CI was green. Then it ran on the staging database and everything looked fine.

Then it hit production and locked a 30-million-row table for four minutes while the ALTER TABLE rewrote every row. The deploy timed out. The rollback took another two minutes because the reverse migration was never tested either.

Database migrations are the highest-risk code you deploy. They touch production data directly, they cannot be rolled back with a simple git revert, and the consequences of a bad one range from “five-minute read-only outage” to “the backup restore from last night.” Despite this, most teams test migrations with nothing more than “did the SQL parse?”

This post is the CI pipeline I wish every team had before their first migration incident. It covers the four things you should validate for every migration: syntax, rollback, data preservation, and performance impact. Each check runs automatically on every PR, and the whole thing takes about an hour to set up.

Why code review is not enough for migrations

A migration PR looks like one file with a few lines of SQL. It is easy to review for typos and logic errors. It is almost impossible to review for the things that actually cause production incidents:

Lock duration: ALTER TABLE on a large table acquires an ACCESS EXCLUSIVE lock. Will it finish before your connection timeout?
Data loss: Does the down migration actually restore the data the up migration transformed? Or does it just drop the column and hope nobody noticed?
Hidden dependencies: Does some other team’s cron job query a column you are dropping? Does a materialized view reference a table you are renaming?
Performance regression: Does adding that index cause a write-throughput drop? Does dropping that index make a query that used to run in 5ms run in 5 seconds?
Partial failure: The migration runs five ALTER TABLE statements. The third one fails. Is the database in a consistent state?

None of these are visible from reading the SQL. A CI pipeline that actually tests migrations catches all of them.

The four-layer migration test pipeline

Every migration PR triggers four stages, each building on the last:

Stage 1: Syntax validation         ---  5 seconds
Stage 2: Rollback verification      --- 30 seconds
Stage 3: Data-preservation audit    ---  2 minutes
Stage 4: Performance impact check   ---  5 minutes

The pipeline spins up a fresh Postgres container for every PR branch, applies the migration, runs the checks, and tears it down. No shared state. No leftover schemas from a previous run.

Stage 1: Syntax validation

This is the floor. It catches typos, missing semicolons, and reference errors. Most teams run this. Some teams skip it and then wonder why the migration failed with ERROR: syntax error at or near "AERT".

The implementation is trivial with any CI runner. Here is a GitHub Actions job that validates every migration file against a fresh Postgres:

migration-validate:
  runs-on: ubuntu-latest
  services:
    postgres:
      image: postgres:16
      env:
        POSTGRES_DB: test_migrations
        POSTGRES_PASSWORD: test
      options: >-
        --health-cmd pg_isready
        --health-interval 5s
        --health-timeout 5s
        --health-retries 5
  steps:
    - uses: actions/checkout@v4
    - name: Validate migrations
      run: |
        for f in migrations/*.sql; do
          echo "Validating $f..."
          psql "$DATABASE_URL" -f "$f" > /dev/null
          if [ $? -ne 0 ]; then
            echo "FAILED: $f"
            exit 1
          fi
        done
      env:
        DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations

This runs every SQL file in order against a fresh database. If any file has a syntax error, the job fails and the PR cannot merge.

One refinement: run each migration file in a separate transaction and roll it back afterward, so the validation does not leave artifacts that affect the next file. This catches errors in individual migrations without depending on ordering:

for f in migrations/*.sql; do
  echo "Validating $f..."
  psql "$DATABASE_URL" -c "BEGIN;"
  psql "$DATABASE_URL" -f "$f" > /dev/null
  if [ $? -ne 0 ]; then
    psql "$DATABASE_URL" -c "ROLLBACK;"
    echo "FAILED: $f"
    exit 1
  fi
  psql "$DATABASE_URL" -c "ROLLBACK;"
done

Stage 2: Rollback verification

Up migrations get all the attention. Down migrations get a “I will write it later” that never comes. When the incident hits and the team needs to roll back, the down migration either does not exist or was never tested.

Rollback verification applies the up migration, then applies the down migration, then checks that the database schema matches the original state. The exact schema is checked, not just “did the SQL run without errors.”

# Capture the schema before the migration.
pg_dump --schema-only "$DATABASE_URL" > schema_before.sql

# Apply the up migration.
psql "$DATABASE_URL" -f migrations/20260613_add_status_column.up.sql

# Apply the down migration.
psql "$DATABASE_URL" -f migrations/20260613_add_status_column.down.sql

# Capture the schema after rollback.
pg_dump --schema-only "$DATABASE_URL" > schema_after.sql

# Compare. They should be identical.
diff schema_before.sql schema_after.sql || {
  echo "ROLLBACK TEST FAILED: schema does not match original"
  exit 1
}

A diff on pg_dump --schema-only catches the common rollback bugs: the down migration drops a column the up migration created but forgets to restore the original default value, or it drops the index but the up migration created two indexes and the down only drops one.

This test saved my team when a migration added a CHECK constraint with a down migration that only dropped the constraint by name, but the up migration had run after the constraint was renamed by an earlier migration. The schema diff caught the mismatch immediately.

Stage 3: Data-preservation audit

The first two stages validate the schema. They do not validate the data. A migration that transforms data (splitting a column, backfilling values, moving data between tables) can run without errors and silently corrupt or lose data.

Data-preservation testing starts from a seed database containing representative data. This is not a full production restore (that is expensive and may contain PII). It is a purpose-built fixture set that exercises every edge case your migration might encounter.

// test/migrations/data-preservation.test.ts
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import { Client } from 'pg';

const client = new Client({
  connectionString: process.env.DATABASE_URL,
});

// Seed data that exercises edge cases.
const USERS = [
  { id: 1, email: 'alice@example.com', email_address: null },
  { id: 2, email: null, email_address: 'bob@example.com' },
  { id: 3, email: 'carol@example.com', email_address: 'carol-work@example.com' },
  { id: 4, email: '', email_address: null },
];

beforeAll(async () => {
  await client.connect();
  // Apply the migration under test.
  await client.query(`
    INSERT INTO users (id, email, email_address)
    VALUES ${USERS.map((u, i) => `($${i * 3 + 1}, $${i * 3 + 2}, $${i * 3 + 3})`).join(', ')}
  `, USERS.flatMap(u => [u.id, u.email, u.email_address]));
});

afterAll(async () => {
  await client.end();
});

describe('merge_email_addresses migration', () => {
  it('combines email and email_address into a single field', async () => {
    // Apply the migration that merges the two columns.
    // ... apply migration logic ...

    const result = await client.query('SELECT id, email FROM users ORDER BY id');

    expect(result.rows[0].email).toBe('alice@example.com');     // preferred field
    expect(result.rows[1].email).toBe('bob@example.com');       // only had email_address
    expect(result.rows[2].email).toBe('carol-work@example.com');// email_address takes priority
    expect(result.rows[3].email).toBe('');                       // both were empty/null
  });

  it('produces the same result after rollback and re-apply', async () => {
    // Roll back, seed fresh, apply again, verify idempotency.
    const before = await client.query('SELECT count(*) as cnt FROM users');
    // Roll down then up again...
    const after = await client.query('SELECT count(*) as cnt FROM users');
    expect(after.rows[0].cnt).toBe(before.rows[0].cnt);
  });
});

The fixture data covers four cases: only the old column populated, only the new column populated, both columns populated (conflict resolution), and the empty-string edge case. Most migration bugs hide in these edge cases, not in the happy path.

Stage 4: Performance impact check

The most expensive thing you can do in a CI pipeline is guess how long a migration will take on production. The second most expensive thing is not checking at all.

A migration that runs fine on an empty test database can take twenty minutes on a production table with 50 million rows. The CI pipeline cannot replicate your production data volume, but it can flag migrations that will obviously cause problems.

Use EXPLAIN to check for table rewrites before the migration runs:

# For ALTER TABLE statements, check if they require a rewrite.
# Postgres 11+ logs this. For older versions, check the ALTER TABLE docs.
# Use EXPLAIN to estimate the work involved:
psql "$DATABASE_URL" -c "
  SELECT 'Table rewrite required' AS warning
  FROM pg_class
  WHERE relname = 'users'
    AND reltuples > 100000
" | grep 'rewrite'

A more practical approach is to run the migration against a table with a realistic number of rows. Restore a subset of production data (without PII) into the CI database, time the migration, and fail if it exceeds a threshold.

# Insert 500k rows to create a realistic table size for benchmarking.
psql "$DATABASE_URL" -c "
  INSERT INTO users (email, created_at)
  SELECT
    'user' || generate_series(1, 500000) || '@example.com',
    now() - random() * interval '365 days'
"

# Time the migration.
START=$(date +%s%N)
psql "$DATABASE_URL" -f migrations/20260613_add_index.up.sql
END=$(date +%s%N)
DURATION=$(( (END - START) / 1000000 ))

echo "Migration took ${DURATION}ms"

if [ "$DURATION" -gt 10000 ]; then
  echo "WARNING: Migration took >10s on 500k rows. Check production impact."
fi

The 500k-row benchmark does not tell you exactly how long it will take on a 50-million-row production table. But it gives you an order-of-magnitude estimate. If it takes 8 seconds on 500k rows, it will take roughly 800 seconds on 50 million rows. That is enough information to block the merge and ask the author to refactor the migration using the expand-migrate-contract pattern.

Putting it together: a full PR pipeline

Here is the complete GitHub Actions workflow that combines all four stages:

name: Migration CI

on:
  pull_request:
    paths:
      - 'migrations/**'

jobs:
  migrate:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: test_migrations
          POSTGRES_PASSWORD: test
        options: >-
          --health-cmd pg_isready
          --health-interval 5s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4

      - name: Stage 1 - Syntax validation
        run: |
          for f in migrations/*.up.sql; do
            psql "$DATABASE_URL" -c "BEGIN;" > /dev/null
            psql "$DATABASE_URL" -f "$f" > /dev/null
            psql "$DATABASE_URL" -c "ROLLBACK;" > /dev/null
            echo "OK: $f"
          done
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations

      - name: Stage 2 - Rollback verification
        run: |
          pg_dump --schema-only "$DATABASE_URL" > schema_before.sql
          for f in migrations/*.up.sql; do
            psql "$DATABASE_URL" -f "$f" > /dev/null
          done
          # Apply down migrations in reverse order.
          for f in $(ls migrations/*.down.sql | sort -r); do
            psql "$DATABASE_URL" -f "$f" > /dev/null
          done
          pg_dump --schema-only "$DATABASE_URL" > schema_after.sql
          diff schema_before.sql schema_after.sql || {
            echo "ROLLBACK FAILED"
            diff schema_before.sql schema_after.sql
            exit 1
          }
          echo "Rollback test passed"
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations

      - name: Stage 3 - Data preservation
        run: |
          npm ci
          npm run test:migrations
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations

      - name: Stage 4 - Performance check
        run: |
          # Seed 500k rows for a realistic table size.
          psql "$DATABASE_URL" -c "
            INSERT INTO users (email, created_at)
            SELECT 'benchmark' || generate_series(1, 500000) || '@test.com', now();
          "
          START=$(date +%s%N)
          psql "$DATABASE_URL" -f migrations/20260613_add_index.up.sql
          END=$(date +%s%N)
          DURATION=$(( (END - START) / 1000000 ))
          echo "Migration duration: ${DURATION}ms"
          if [ "$DURATION" -gt 10000 ]; then
            echo "FAIL: Migration took >10s on 500k rows. Refactor required."
            exit 1
          fi
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations

This pipeline runs in about eight minutes on a fresh Postgres container. The syntax check is nearly instant. The rollback test takes about thirty seconds. The data-preservation tests take a minute or two. The performance check takes five minutes including the seed step.

What this pipeline catches that reviews miss

In the six months since my team adopted this pipeline, it caught:

A migration that dropped a default value the down migration did not restore (stage 2).
A data backfill that used COALESCE(email, email_address) but null coalesces to null when both are null, losing the empty-string distinction (stage 3).
An index migration on a table with 500k synthetic rows that took 12 seconds, extrapolating to a 20-minute lock on production (stage 4).
A migration that added a NOT NULL column without a default on a table that already had rows, which would have failed immediately on a populated table (stage 1 caught this as syntax-valid but the actual error would have been runtime).

Each of these was caught in CI, not in an incident postmortem.

When to skip these tests

Not every migration needs the full pipeline. An ALTER TABLE that adds a nullable column with no default on a table with 100 rows is low risk. Running the performance benchmark on it is noise.

Use a convention-based approach: if the migration touches a table in the critical_tables list (defined in a config file at the repo root), run the full pipeline. Otherwise, run stages 1 and 2 only.

- name: Skip performance check for non-critical tables
  run: |
    table_name=$(grep -oP '(?<=TABLE\s+)(\w+)' migrations/*.up.sql)
    if grep -q "$table_name" critical_tables.txt 2>/dev/null; then
      echo "Running full pipeline for table: $table_name"
    else
      echo "Skipping performance check for: $table_name"
      exit 0
    fi

A CI pipeline for migrations costs about an hour of setup time and eight minutes per PR run. The alternative is a production incident that costs hours of emergency debugging, a rollback that might not work, and a postmortem that says “we should have tested the migration.” Hour for hour, testing migrations in CI is one of the highest-return investments you can make in your deployment pipeline.

The next time someone opens a PR with a single ALTER TABLE ADD COLUMN and says “it’s just one line, it’s fine,” point them at this pipeline. One line of SQL that runs on a table with 50 million rows is the most dangerous change you will merge all week. Test it like it is.

A note from Yojji

The discipline of testing database migrations before they reach production is the kind of engineering rigor that separates teams who ship confidently from teams who treat every deploy as a prayer. Yojji’s teams build CI pipelines that validate schema changes, test rollbacks against real data shapes, and catch performance regressions before they lock a production table. It is the unglamorous, high-return work that keeps production stable and deploys boring.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Why code review is not enough for migrations

The four-layer migration test pipeline

Stage 1: Syntax validation

Stage 2: Rollback verification

Stage 3: Data-preservation audit

Stage 4: Performance impact check

Putting it together: a full PR pipeline

What this pipeline catches that reviews miss

When to skip these tests

A note from Yojji