
























You wrote a migration. It passed code review. The reviewer checked the SQL syntax, confirmed the column name is not a typo, and approved the PR. CI was green. Then it ran on the staging database and everything looked fine.
Then it hit production and locked a 30-million-row table for four minutes while the ALTER TABLE rewrote every row. The deploy timed out. The rollback took another two minutes because the reverse migration was never tested either.
Database migrations are the highest-risk code you deploy. They touch production data directly, they cannot be rolled back with a simple git revert, and the consequences of a bad one range from “five-minute read-only outage” to “the backup restore from last night.” Despite this, most teams test migrations with nothing more than “did the SQL parse?”
This post is the CI pipeline I wish every team had before their first migration incident. It covers the four things you should validate for every migration: syntax, rollback, data preservation, and performance impact. Each check runs automatically on every PR, and the whole thing takes about an hour to set up.
A migration PR looks like one file with a few lines of SQL. It is easy to review for typos and logic errors. It is almost impossible to review for the things that actually cause production incidents:
ALTER TABLE on a large table acquires an ACCESS EXCLUSIVE lock. Will it finish before your connection timeout?None of these are visible from reading the SQL. A CI pipeline that actually tests migrations catches all of them.
Every migration PR triggers four stages, each building on the last:
Stage 1: Syntax validation --- 5 seconds
Stage 2: Rollback verification --- 30 seconds
Stage 3: Data-preservation audit --- 2 minutes
Stage 4: Performance impact check --- 5 minutes
The pipeline spins up a fresh Postgres container for every PR branch, applies the migration, runs the checks, and tears it down. No shared state. No leftover schemas from a previous run.
This is the floor. It catches typos, missing semicolons, and reference errors. Most teams run this. Some teams skip it and then wonder why the migration failed with ERROR: syntax error at or near "AERT".
The implementation is trivial with any CI runner. Here is a GitHub Actions job that validates every migration file against a fresh Postgres:
migration-validate:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test_migrations
POSTGRES_PASSWORD: test
options: >-
--health-cmd pg_isready
--health-interval 5s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Validate migrations
run: |
for f in migrations/*.sql; do
echo "Validating $f..."
psql "$DATABASE_URL" -f "$f" > /dev/null
if [ $? -ne 0 ]; then
echo "FAILED: $f"
exit 1
fi
done
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations
This runs every SQL file in order against a fresh database. If any file has a syntax error, the job fails and the PR cannot merge.
One refinement: run each migration file in a separate transaction and roll it back afterward, so the validation does not leave artifacts that affect the next file. This catches errors in individual migrations without depending on ordering:
for f in migrations/*.sql; do
echo "Validating $f..."
psql "$DATABASE_URL" -c "BEGIN;"
psql "$DATABASE_URL" -f "$f" > /dev/null
if [ $? -ne 0 ]; then
psql "$DATABASE_URL" -c "ROLLBACK;"
echo "FAILED: $f"
exit 1
fi
psql "$DATABASE_URL" -c "ROLLBACK;"
done
Up migrations get all the attention. Down migrations get a “I will write it later” that never comes. When the incident hits and the team needs to roll back, the down migration either does not exist or was never tested.
Rollback verification applies the up migration, then applies the down migration, then checks that the database schema matches the original state. The exact schema is checked, not just “did the SQL run without errors.”
# Capture the schema before the migration.
pg_dump --schema-only "$DATABASE_URL" > schema_before.sql
# Apply the up migration.
psql "$DATABASE_URL" -f migrations/20260613_add_status_column.up.sql
# Apply the down migration.
psql "$DATABASE_URL" -f migrations/20260613_add_status_column.down.sql
# Capture the schema after rollback.
pg_dump --schema-only "$DATABASE_URL" > schema_after.sql
# Compare. They should be identical.
diff schema_before.sql schema_after.sql || {
echo "ROLLBACK TEST FAILED: schema does not match original"
exit 1
}
A diff on pg_dump --schema-only catches the common rollback bugs: the down migration drops a column the up migration created but forgets to restore the original default value, or it drops the index but the up migration created two indexes and the down only drops one.
This test saved my team when a migration added a CHECK constraint with a down migration that only dropped the constraint by name, but the up migration had run after the constraint was renamed by an earlier migration. The schema diff caught the mismatch immediately.
The first two stages validate the schema. They do not validate the data. A migration that transforms data (splitting a column, backfilling values, moving data between tables) can run without errors and silently corrupt or lose data.
Data-preservation testing starts from a seed database containing representative data. This is not a full production restore (that is expensive and may contain PII). It is a purpose-built fixture set that exercises every edge case your migration might encounter.
// test/migrations/data-preservation.test.ts
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import { Client } from 'pg';
const client = new Client({
connectionString: process.env.DATABASE_URL,
});
// Seed data that exercises edge cases.
const USERS = [
{ id: 1, email: 'alice@example.com', email_address: null },
{ id: 2, email: null, email_address: 'bob@example.com' },
{ id: 3, email: 'carol@example.com', email_address: 'carol-work@example.com' },
{ id: 4, email: '', email_address: null },
];
beforeAll(async () => {
await client.connect();
// Apply the migration under test.
await client.query(`
INSERT INTO users (id, email, email_address)
VALUES ${USERS.map((u, i) => `($${i * 3 + 1}, $${i * 3 + 2}, $${i * 3 + 3})`).join(', ')}
`, USERS.flatMap(u => [u.id, u.email, u.email_address]));
});
afterAll(async () => {
await client.end();
});
describe('merge_email_addresses migration', () => {
it('combines email and email_address into a single field', async () => {
// Apply the migration that merges the two columns.
// ... apply migration logic ...
const result = await client.query('SELECT id, email FROM users ORDER BY id');
expect(result.rows[0].email).toBe('alice@example.com'); // preferred field
expect(result.rows[1].email).toBe('bob@example.com'); // only had email_address
expect(result.rows[2].email).toBe('carol-work@example.com');// email_address takes priority
expect(result.rows[3].email).toBe(''); // both were empty/null
});
it('produces the same result after rollback and re-apply', async () => {
// Roll back, seed fresh, apply again, verify idempotency.
const before = await client.query('SELECT count(*) as cnt FROM users');
// Roll down then up again...
const after = await client.query('SELECT count(*) as cnt FROM users');
expect(after.rows[0].cnt).toBe(before.rows[0].cnt);
});
});
The fixture data covers four cases: only the old column populated, only the new column populated, both columns populated (conflict resolution), and the empty-string edge case. Most migration bugs hide in these edge cases, not in the happy path.
The most expensive thing you can do in a CI pipeline is guess how long a migration will take on production. The second most expensive thing is not checking at all.
A migration that runs fine on an empty test database can take twenty minutes on a production table with 50 million rows. The CI pipeline cannot replicate your production data volume, but it can flag migrations that will obviously cause problems.
Use EXPLAIN to check for table rewrites before the migration runs:
# For ALTER TABLE statements, check if they require a rewrite.
# Postgres 11+ logs this. For older versions, check the ALTER TABLE docs.
# Use EXPLAIN to estimate the work involved:
psql "$DATABASE_URL" -c "
SELECT 'Table rewrite required' AS warning
FROM pg_class
WHERE relname = 'users'
AND reltuples > 100000
" | grep 'rewrite'
A more practical approach is to run the migration against a table with a realistic number of rows. Restore a subset of production data (without PII) into the CI database, time the migration, and fail if it exceeds a threshold.
# Insert 500k rows to create a realistic table size for benchmarking.
psql "$DATABASE_URL" -c "
INSERT INTO users (email, created_at)
SELECT
'user' || generate_series(1, 500000) || '@example.com',
now() - random() * interval '365 days'
"
# Time the migration.
START=$(date +%s%N)
psql "$DATABASE_URL" -f migrations/20260613_add_index.up.sql
END=$(date +%s%N)
DURATION=$(( (END - START) / 1000000 ))
echo "Migration took ${DURATION}ms"
if [ "$DURATION" -gt 10000 ]; then
echo "WARNING: Migration took >10s on 500k rows. Check production impact."
fi
The 500k-row benchmark does not tell you exactly how long it will take on a 50-million-row production table. But it gives you an order-of-magnitude estimate. If it takes 8 seconds on 500k rows, it will take roughly 800 seconds on 50 million rows. That is enough information to block the merge and ask the author to refactor the migration using the expand-migrate-contract pattern.
Here is the complete GitHub Actions workflow that combines all four stages:
name: Migration CI
on:
pull_request:
paths:
- 'migrations/**'
jobs:
migrate:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test_migrations
POSTGRES_PASSWORD: test
options: >-
--health-cmd pg_isready
--health-interval 5s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- name: Stage 1 - Syntax validation
run: |
for f in migrations/*.up.sql; do
psql "$DATABASE_URL" -c "BEGIN;" > /dev/null
psql "$DATABASE_URL" -f "$f" > /dev/null
psql "$DATABASE_URL" -c "ROLLBACK;" > /dev/null
echo "OK: $f"
done
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations
- name: Stage 2 - Rollback verification
run: |
pg_dump --schema-only "$DATABASE_URL" > schema_before.sql
for f in migrations/*.up.sql; do
psql "$DATABASE_URL" -f "$f" > /dev/null
done
# Apply down migrations in reverse order.
for f in $(ls migrations/*.down.sql | sort -r); do
psql "$DATABASE_URL" -f "$f" > /dev/null
done
pg_dump --schema-only "$DATABASE_URL" > schema_after.sql
diff schema_before.sql schema_after.sql || {
echo "ROLLBACK FAILED"
diff schema_before.sql schema_after.sql
exit 1
}
echo "Rollback test passed"
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations
- name: Stage 3 - Data preservation
run: |
npm ci
npm run test:migrations
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations
- name: Stage 4 - Performance check
run: |
# Seed 500k rows for a realistic table size.
psql "$DATABASE_URL" -c "
INSERT INTO users (email, created_at)
SELECT 'benchmark' || generate_series(1, 500000) || '@test.com', now();
"
START=$(date +%s%N)
psql "$DATABASE_URL" -f migrations/20260613_add_index.up.sql
END=$(date +%s%N)
DURATION=$(( (END - START) / 1000000 ))
echo "Migration duration: ${DURATION}ms"
if [ "$DURATION" -gt 10000 ]; then
echo "FAIL: Migration took >10s on 500k rows. Refactor required."
exit 1
fi
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test_migrations
This pipeline runs in about eight minutes on a fresh Postgres container. The syntax check is nearly instant. The rollback test takes about thirty seconds. The data-preservation tests take a minute or two. The performance check takes five minutes including the seed step.
In the six months since my team adopted this pipeline, it caught:
COALESCE(email, email_address) but null coalesces to null when both are null, losing the empty-string distinction (stage 3).Each of these was caught in CI, not in an incident postmortem.
Not every migration needs the full pipeline. An ALTER TABLE that adds a nullable column with no default on a table with 100 rows is low risk. Running the performance benchmark on it is noise.
Use a convention-based approach: if the migration touches a table in the critical_tables list (defined in a config file at the repo root), run the full pipeline. Otherwise, run stages 1 and 2 only.
- name: Skip performance check for non-critical tables
run: |
table_name=$(grep -oP '(?<=TABLE\s+)(\w+)' migrations/*.up.sql)
if grep -q "$table_name" critical_tables.txt 2>/dev/null; then
echo "Running full pipeline for table: $table_name"
else
echo "Skipping performance check for: $table_name"
exit 0
fi
A CI pipeline for migrations costs about an hour of setup time and eight minutes per PR run. The alternative is a production incident that costs hours of emergency debugging, a rollback that might not work, and a postmortem that says “we should have tested the migration.” Hour for hour, testing migrations in CI is one of the highest-return investments you can make in your deployment pipeline.
The next time someone opens a PR with a single ALTER TABLE ADD COLUMN and says “it’s just one line, it’s fine,” point them at this pipeline. One line of SQL that runs on a table with 50 million rows is the most dangerous change you will merge all week. Test it like it is.
The discipline of testing database migrations before they reach production is the kind of engineering rigor that separates teams who ship confidently from teams who treat every deploy as a prayer. Yojji’s teams build CI pipelines that validate schema changes, test rollbacks against real data shapes, and catch performance regressions before they lock a production table. It is the unglamorous, high-return work that keeps production stable and deploys boring.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。