PostgreSQL Backups and Point-in-Time Recovery in Production

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-06-04 · via The Practical Developer

The incident report starts like this: “At 14:32 UTC, a developer ran UPDATE orders SET status = 'canceled' WHERE status = 'pending' inside a transaction block on the production database. The WHERE clause was missing. The migration script had been reviewed, approved, and merged. It had also never been run against a copy of production data.”

Twenty-seven thousand order records updated in a single query. The application noticed immediately. Support tickets flooded in within three minutes. The developer froze. The DBA was on PTO. Someone suggested restoring from the nightly pg_dump, but that backup was fourteen hours stale. All the orders placed, shipped, and delivered in those fourteen hours would be gone. The company lost two days reconstructing state from application logs, payment provider records, and email confirmations.

This is not a story about bad code reviews. It is a story about believing a backup exists without proving the restore works. Every Postgres deployment I have audited has some form of backup configured. Almost none have a tested point-in-time recovery (PITR) pipeline. The gap between “we run pg_dump every night” and “we can restore to any second within the last week” is the difference between a Friday-morning coffee spill and a Friday-afternoon data-loss postmortem.

This post walks through the full backup-and-recovery stack: automated nightly snapshots, continuous WAL archiving, PITR to any transaction, and the restore test that validates the whole thing actually works.

Why pg_dump alone is not enough

pg_dump is excellent for schema migrations, database upgrades, and seeding a development environment. It is not a production disaster-recovery strategy, for two reasons.

First, a pg_dump is a point-in-time snapshot. If you run it at 2 AM, and a rogue query runs at 2 PM, you lose twelve hours of data. The only way to reduce that window is to run dumps more frequently, and at some point the dump itself becomes a production load problem. A full dump of a 500 GB database on a busy primary is not a background task. It generates I/O, fills the WAL, and competes for memory with your application queries.

Second, pg_dump takes time to restore. Restoring a multi-hundred-gigabyte dump means replaying every INSERT, every index build, every constraint check. A database that takes four hours to dump will take six to eight hours to restore. Your SLA will not survive an eight-hour restore window.

The correct production strategy is a combination:

Physical base backups (using pg_basebackup or a tool like pgBackRest) that capture the entire database cluster at the filesystem level.
Continuous WAL archiving that streams every write-ahead log segment to durable object storage (S3, GCS, or a local archive server).
Point-in-time recovery that replays WAL segments from a base backup up to any target timestamp or transaction ID.

This combination gives you RPO (Recovery Point Objective) measured in seconds, not hours, and RTO (Recovery Time Objective) limited by how fast you can copy files and replay WAL segments.

Setting up WAL archiving

PostgreSQL writes every change to a write-ahead log before it touches the data files. Those WAL segments are the foundation of PITR. You configure the server to archive completed segments to a destination outside the cluster.

Edit postgresql.conf:

wal_level = replica                    # minimum for archiving; 'logical' for CDC
archive_mode = on
archive_command = 'aws s3 cp %p s3://your-backup-bucket/wal/%f --storage-class STANDARD_IA'
archive_timeout = 60                   # force a segment switch every 60s during idle periods

The %p placeholder is the full path of the WAL file to archive. The %f is just the filename. The archive_command must return exit code 0 for success; Postgres retries failed archives automatically, and logs warnings when a segment cannot be archived.

Set archive_timeout to 60 seconds (or lower for high-traffic databases). Without it, WAL segments are only archived when they fill up (typically 16 MB each). On an idle database, that could take hours. Your RPO cannot be better than your archiving interval.

Verify archiving is working:

SELECT * FROM pg_stat_archiver;

The last_archived_wal column should show a recent timestamp. If it shows NULL or an old value, check the Postgres logs for archive command failures.

Taking a base backup

With archiving running, take an initial base backup. The tool pg_basebackup ships with PostgreSQL and creates a consistent snapshot without locking the database for writes:

pg_basebackup \
  --host=your-primary-host \
  --port=5432 \
  --username=backup_user \
  --pgdata=/backups/base/$(date +%Y-%m-%d_%H%M%S) \
  --format=tar \
  --gzip \
  --wal-method=stream \
  --progress \
  --label="weekly_base_$(date +%Y-%m-%d)"

Key flags:

--wal-method=stream streams WAL segments alongside the backup so the snapshot is internally consistent.
--format=tar --gzip produces a single compressed tarball per tablespace.
--label lets you identify this backup later in pg_history.

On a production primary, pg_basebackup reads the data files while the database continues serving writes. It has minimal impact on query performance, but it does generate additional WAL. Schedule it during low-traffic windows.

For large databases (multiple terabytes), consider pgBackRest or WAL-G instead of pg_basebackup. These tools support incremental backups, parallel restore, and S3-native storage without intermediate files.

Here is the cron job that runs the base backup weekly and removes backups older than 30 days:

#!/bin/bash
# /usr/local/bin/postgres-backup.sh
set -euo pipefail

BACKUP_DIR="/backups/base"
RETENTION_DAYS=30
BACKUP_LABEL="weekly_base_$(date +%Y-%m-%d)"
BACKUP_PATH="${BACKUP_DIR}/${BACKUP_LABEL}"

pg_basebackup \
  --host=/var/run/postgresql \
  --username=backup_user \
  --pgdata="${BACKUP_PATH}" \
  --format=tar \
  --gzip \
  --wal-method=stream \
  --progress \
  --label="${BACKUP_LABEL}" 2>&1 | logger -t postgres-backup

# Upload to S3
aws s3 sync "${BACKUP_DIR}/" "s3://your-backup-bucket/base/" \
  --storage-class STANDARD_IA

# Clean old local backups
find "${BACKUP_DIR}" -maxdepth 1 -type d -mtime +${RETENTION_DAYS} -exec rm -rf {} \;

Point-in-time recovery step by step

When disaster strikes, this is the recovery procedure. The goal is to stand up a new Postgres instance that contains every committed transaction up to the exact moment before the bad query executed.

Step 1: Identify the target time

From the application logs or incident timeline, determine the precise timestamp of the destructive query. If the query was inside an explicit transaction, you can also use the transaction ID from pg_stat_activity (logged before the incident).

Step 2: Restore the base backup

Create a new data directory from the most recent base backup:

# On the recovery server (could be the same machine, different port)
BACKUP_DATE="2026-06-03"
tar -xzf /backups/base/weekly_base_${BACKUP_DATE}.tar.gz -C /var/lib/postgresql/16/recovery/

Step 3: Configure recovery settings

Create a recovery.signal file (Postgres 12+) in the data directory and edit postgresql.conf:

restore_command = 'aws s3 cp s3://your-backup-bucket/wal/%f %p'
recovery_target_time = '2026-06-04 14:31:59 UTC'
recovery_target_action = 'promote'

The restore_command fetches archived WAL segments from S3. Postgres replays them in order until it reaches the target time, then stops recovery and promotes the instance to a writable primary.

Step 4: Start PostgreSQL in recovery mode

pg_ctl -D /var/lib/postgresql/16/recovery/ start

Watch the logs:

tail -f /var/log/postgresql/postgresql-16-recovery.log

Postgres replays WAL segments and eventually prints:

LOG:  recovery stopping before commit of transaction at 2026-06-04 14:31:59 UTC
LOG:  recovery has paused
LOG:  recovery ended at 2026-06-04 14:32:01 UTC
LOG:  database system is ready to accept connections

Step 5: Verify the data

Connect to the recovered instance and run sanity checks:

SELECT COUNT(*) FROM orders WHERE status = 'canceled';
-- Should show only the legitimate cancellations before the incident

SELECT current_timestamp;
-- Server clock should be close to the target time, not real time

Compare row counts, recent record timestamps, and a few known reference records against your observability dashboard or read replica.

Step 6: Export and backfill

If you are recovering a subset of tables (the bad query affected only one table), export just those rows and import them into production:

pg_dump -h recovery-host -t orders \
  --data-only \
  --disable-triggers \
  --on-conflict-do-nothing \
  stale_orders.sql

psql -h production-host -d yourdb -f stale_orders.sql

If the scope is wider (schema corruption, dropped table, entire database), promote the recovery instance to become the new primary and redirect traffic.

Automating restore testing

The single most important practice: test your restore regularly. An untested backup is a prayer, not a plan.

Automate a restore test in CI or a weekly cron job. Here is a script that spins up a temporary Postgres instance, restores the latest backup, replays WAL to the end, and validates data integrity:

#!/bin/bash
# /usr/local/bin/test-restore.sh
set -euo pipefail

TEST_DIR=$(mktemp -d)
trap 'rm -rf ${TEST_DIR}' EXIT

# Download latest base backup
LATEST_BACKUP=$(aws s3 ls s3://your-backup-bucket/base/ | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://your-backup-bucket/base/${LATEST_BACKUP}" "${TEST_DIR}/base.tar.gz"
tar -xzf "${TEST_DIR}/base.tar.gz" -C "${TEST_DIR}/data"

# Configure recovery to the end of WAL
cat >> "${TEST_DIR}/data/postgresql.conf" << 'CONF'
restore_command = 'aws s3 cp s3://your-backup-bucket/wal/%f %p'
recovery_target_action = 'promote'
CONF

touch "${TEST_DIR}/data/recovery.signal"

# Start temporary instance on port 5433
pg_ctl -D "${TEST_DIR}/data" -o "-p 5433" -l "${TEST_DIR}/pg.log" start

# Wait for recovery to complete
sleep 10

# Run validation queries
psql -p 5433 -d yourdb -c "
  SELECT 'table_count' as check, count(*)::text as value FROM information_schema.tables WHERE table_schema = 'public';
  SELECT 'recent_order' as check, MAX(created_at)::text as value FROM orders;
  SELECT 'user_count' as check, count(*)::text as value FROM users;
" || {
  echo "RESTORE TEST FAILED"
  cat "${TEST_DIR}/pg.log"
  exit 1
}

# Check that recovery completed
RECOVERY_STATUS=$(psql -p 5433 -d yourdb -t -c "SELECT pg_is_in_recovery()")
if [ "${RECOVERY_STATUS}" = "t" ]; then
  echo "ERROR: Instance did not complete recovery and promote"
  exit 1
fi

echo "RESTORE TEST PASSED"
pg_ctl -D "${TEST_DIR}/data" -o "-p 5433" stop

Run this weekly and alert if it fails. The alert should be P1 priority. A failed restore test means your backup pipeline is broken, and you will not know until you need it.

Retention policy and tiered backups

Not every backup needs to be kept for the same duration. Use a tiered retention policy:

Tier	Frequency	Retention	Purpose
Daily	pg_dump	7 days	Quick single-table restores, dev seeds
Weekly	Base backup	30 days	Full cluster recovery with PITR
Monthly	Base backup	12 months	Compliance and legal holds
WAL	Continuous	matches weekly base retention	PITR within the retention window

Expired base backups make their corresponding WAL segments unrecoverable in a PITR sense. If you keep WAL segments but delete the base backup they chain to, that WAL is useless. Keep base backups and WAL segments on the same retention schedule.

Implement lifecycle policies in S3 to transition old backups to cheaper storage classes:

{
  "Rules": [
    {
      "Id": "wal-to-glacier",
      "Status": "Enabled",
      "Filter": {"Prefix": "wal/"},
      "Transitions": [
        {"Days": 30, "StorageClass": "GLACIER_IR"}
      ],
      "Expiration": {"Days": 90}
    },
    {
      "Id": "base-to-glacier",
      "Status": "Enabled",
      "Filter": {"Prefix": "base/"},
      "Transitions": [
        {"Days": 90, "StorageClass": "GLACIER_IR"}
      ],
      "Expiration": {"Days": 365}
    }
  ]
}

What about pgBackRest and WAL-G?

For production deployments larger than a few hundred gigabytes, drop pg_basebackup and use a dedicated backup tool.

pgBackRest is the most mature option for PostgreSQL backups. It supports incremental and differential backups, parallel backup and restore (splits files into stripes), S3 and Azure blob storage natively, checksum validation on every backup, and backup retention management with archive expiration rules. It is maintained by the PostgreSQL community and used by most serious Postgres shops.

WAL-G (a fork of the earlier WAL-E) is lighter weight and tightly integrated with cloud object stores. It compresses WAL segments with LZ4, zstd, or brotli and supports per-database or per-table restore from a backup. It is the default choice for managed PostgreSQL on cloud services like Google Cloud SQL and is also the backup tool used by TimescaleDB.

Both tools follow the same architectural pattern as the manual setup above: base backup + WAL archiving + PITR. They just handle the mechanics better, with retry logic, parallel transfers, and built-in encryption.

Whichever tool you choose, the non-negotiable practice is the same: automate the restore test.

A note from Yojji

The difference between a team that survives a production database incident and one that does not often comes down to whether the backup-and-restore pipeline has been tested end-to-end before the panic sets in. Building a properly instrumented WAL archive with automated restore verification is the kind of infrastructure discipline that separates a prototype from a platform that can survive a bad migration or a midnight incident. Yojji’s teams have walked through enough production recoveries to know that the restore test is not a checkbox exercise; it is the single most important validation you will ever run. They bring that same operational rigor to the full-cycle applications they build, from cloud infrastructure to CI/CD pipelines.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their senior engineering teams specialize in the JavaScript ecosystem, cloud-native infrastructure on AWS, Azure, and Google Cloud, and the full cycle of product delivery from discovery through DevOps.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Why pg_dump alone is not enough

Setting up WAL archiving

Taking a base backup

Point-in-time recovery step by step

Step 1: Identify the target time

Step 2: Restore the base backup

Step 3: Configure recovery settings

Step 4: Start PostgreSQL in recovery mode

Step 5: Verify the data

Step 6: Export and backfill

Automating restore testing

Retention policy and tiered backups

What about pgBackRest and WAL-G?

A note from Yojji