Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2023-02-03 · via The Practical Developer

The events table has 4 GB of live rows. The on-disk size is 80 GB. Queries that used to take 50 ms now take 4 seconds. Somebody runs VACUUM events;, no change. Somebody runs VACUUM FULL events;, table is locked for 20 minutes and shrinks to 4 GB. Two weeks later, it is back at 60 GB.

This is bloat, and it is the single most misunderstood operational issue in Postgres. Autovacuum is on by default, autovacuum is running according to the logs, and the table is bloating anyway. The reason is almost always the same: a long-running transaction is preventing VACUUM from freeing dead row versions, and dead rows are accumulating faster than VACUUM can keep up. This post is what bloat actually is, the queries that diagnose it, and the four interventions in order of cost.

What VACUUM does, briefly

Postgres uses MVCC: when you UPDATE or DELETE a row, the old version stays on disk, marked invisible to new transactions but still readable by older ones that need it. New row versions are written elsewhere on the page. Over time, dead row versions pile up. VACUUM scans the table, identifies dead row versions whose xmin (transaction ID) is older than the oldest active transaction, and marks their disk space as reusable.

Two things to internalize:

VACUUM does not return disk space to the OS. It marks pages as having free space inside them; future inserts/updates can reuse that space. The on-disk file size shrinks only with VACUUM FULL (locks the table, rewrites it).
VACUUM cannot remove a dead row whose xmin is greater than the oldest running transaction’s xmin. A long-running transaction holds the floor and freezes VACUUM’s effective horizon.

That second point is where 95% of bloat issues come from.

The query that shows the problem

SELECT
  pid,
  age(backend_xmin) AS xmin_age,
  state,
  query,
  now() - xact_start AS xact_age
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC
LIMIT 5;

xmin_age is the number of transactions that have elapsed since the connection’s snapshot was taken. Anything over a few hundred thousand is concerning. Over a few million is the smoking gun.

The candidates for “long-running transaction blocking VACUUM”:

Idle transactions (state = 'idle in transaction'). The connection started a transaction, did one query, and is now sitting idle. Common cause: an app forgot to commit, or a debugger paused execution.
Long analytics queries. A 30-minute reporting query is fine in isolation, but it pins the VACUUM horizon for everyone.
Replication slots with stale confirmed_flush_lsn. A replication consumer that has stopped reading WAL holds back VACUUM via the xmin of its slot.
Prepared transactions (two-phase commit) that nobody finished.

Run the query above on any bloated database and 90% of the time you find the culprit on the first row.

The query that shows the bloat

SELECT
  schemaname || '.' || relname AS table,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  n_dead_tup,
  n_live_tup,
  round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum,
  last_vacuum
FROM pg_stat_all_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY n_dead_tup DESC
LIMIT 20;

dead_pct over 20% on a hot table is bloat. Over 50% is severe. The last_autovacuum and last_vacuum columns tell you when Postgres last cleaned the table. If it has been hours and the dead count is high, something is preventing autovacuum from working.

For a sharper bloat estimate (which accounts for index bloat too), the pgstattuple extension is more accurate but more expensive:

CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('events');

This reads every page, so for a huge table it is slow; run it during off hours.

The four interventions, cheapest to most disruptive

1. Kill the long-running transaction. If pg_stat_activity shows an idle-in-transaction connection holding back VACUUM, terminate it:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity
 WHERE state = 'idle in transaction'
   AND xact_start < now() - interval '10 minutes';

Cost: zero, except killing whatever the app was in the middle of. Set idle_in_transaction_session_timeout = '5min' in postgresql.conf so future connections cannot pin VACUUM by accident. This single change prevents most bloat incidents.

2. Tune autovacuum for hot tables. Default thresholds are too lax for tables with millions of writes per day:

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.05,    -- vacuum at 5% dead, default 20%
  autovacuum_vacuum_threshold = 1000,       -- minimum 1000 dead rows
  autovacuum_analyze_scale_factor = 0.02
);

The defaults assume small tables. A 100M-row table at 20% bloat is 20M dead rows; that is a lot. For hot tables, drop the scale factor.

Also worth raising autovacuum’s parallelism:

# postgresql.conf
autovacuum_max_workers = 6
autovacuum_vacuum_cost_limit = 2000  # less throttling
maintenance_work_mem = 1GB           # more memory per VACUUM

The cost-limit and work-mem changes let VACUUM run faster at the price of more I/O. On modern SSDs the trade is usually worth it.

3. VACUUM (VERBOSE, ANALYZE) on demand. When you have caught the table mid-bloat, manually run:

VACUUM (VERBOSE, ANALYZE) events;

This compacts dead row space without rewriting the table. On-disk size does not shrink, but future inserts reuse the space, so the table stops growing.

For specific row classes, VACUUM (FREEZE) early-freezes old rows so VACUUM has less work later, which is useful for append-mostly tables.

4. pg_repack to reclaim disk. When you actually need the disk space back without VACUUM FULL’s exclusive lock, pg_repack rewrites the table online:

pg_repack -h db.example.com -d appdb -t events

It builds a new copy concurrently, syncs writes via triggers, and swaps the new table in atomically. Takes a long time on huge tables but the production impact is small. The cost: temporarily 2× the disk space (during rebuild).

VACUUM FULL does the same job but holds an ACCESS EXCLUSIVE lock; no one can read or write the table during the operation. Use it only for tables small enough that ~1 minute of downtime is acceptable.

Index bloat, separately

VACUUM cleans tables. Indexes have their own bloat. When rows are updated, the index entries pointing to old row versions become dead and need cleaning. VACUUM does some of this; on heavily-updated indexes, it does not keep up.

-- Detect index bloat
SELECT
  schemaname || '.' || tablename AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 1000000000  -- > 1GB
ORDER BY pg_relation_size(indexrelid) DESC;

Fix:

REINDEX INDEX CONCURRENTLY events_user_id_idx;

CONCURRENTLY rebuilds without taking a write lock. Available since Postgres 12.

Replication slots: the silent bloat cause

A logical replication slot whose consumer has fallen behind keeps xmin pinned. Check for slots:

SELECT slot_name, active, age(xmin) AS xmin_age,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;

If a slot has active = false and xmin_age in the millions, the consumer has been gone for a long time. Either reconnect the consumer or drop the slot:

SELECT pg_drop_replication_slot('orphaned_slot');

Dropping a slot that another system thinks it owns will mean re-bootstrapping that consumer, so make sure you know what you are dropping.

Monitoring you should set up

Three alerts catch bloat issues before they become outages:

-- 1. Alert if any backend xmin is older than 1M transactions.
SELECT max(age(backend_xmin)) FROM pg_stat_activity WHERE backend_xmin IS NOT NULL;
-- threshold: > 1000000

-- 2. Alert on dead-tuple ratio per hot table.
SELECT relname, n_dead_tup::float / NULLIF(n_live_tup, 0) AS ratio
FROM pg_stat_user_tables
WHERE n_live_tup > 100000;
-- threshold: > 0.3

-- 3. Alert on table growth rate beyond expected.
-- (Compare pg_relation_size now vs 24h ago for the same table.)

Pipe them through Prometheus / DataDog / your dashboard. The alert cost is about 10 lines of config per check.

What `idle_in_transaction_session_timeout` is worth

If you set one parameter as a result of reading this post, set this:

idle_in_transaction_session_timeout = '5min'

In postgresql.conf. Reload. This kills any transaction that has been idle for 5 minutes. The downside is application bugs may now manifest as terminated transactions; the upside is you cannot accidentally pin VACUUM for hours because somebody hit a debugger breakpoint.

Pair with statement_timeout (kills statements running longer than the threshold) and lock_timeout (kills queries waiting too long for a lock). Three timeouts; one-line each; eliminate a class of incidents.

The takeaway

Postgres bloat is not “VACUUM is broken.” It is “VACUUM is being prevented from working by something else.” The diagnosis is two pg_stat_activity and pg_stat_all_tables queries. The four fixes (kill the long transaction, tune autovacuum thresholds, manual VACUUM, pg_repack) are progressive in cost. idle_in_transaction_session_timeout is the single setting that prevents 80% of the cases.

Most “the database is slow” incidents that turn out to be bloat-related are one query and one config change away from being closed. The next time you see the on-disk table size diverge from the live row count, run the queries above before you reach for VACUUM FULL.

A note from Yojji

The kind of database operations work that keeps a Postgres instance healthy at year five (autovacuum tuning, slot monitoring, the boring details that decide whether disk usage stays linear or explodes) is the kind of long-haul backend engineering Yojji’s teams have shipped across hundreds of client projects.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and the kind of database and infrastructure work that decides whether a product feels good a year later or grinds slowly to a halt.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

What VACUUM does, briefly

The query that shows the problem

The query that shows the bloat

The four interventions, cheapest to most disruptive

Index bloat, separately

Replication slots: the silent bloat cause

Monitoring you should set up

What idle_in_transaction_session_timeout is worth

The takeaway

A note from Yojji

What `idle_in_transaction_session_timeout` is worth