Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99%

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-16 · via The Practical Developer

Your events table just crossed one billion rows. Every sensor reading, page view, and API log lands here, timestamped and immutable. Queries almost always ask for a time range: “give me the last hour” or “aggregate by day for the last week.” You have a B-tree index on created_at. It is 4.2 GB. Insert throughput is dropping because every INSERT now touches multiple random leaf pages in that index. Autovacuum runs constantly but never quite catches up. And your storage bill grew another 30% this quarter.

The problem is not the data volume. Postgres handles billions of rows fine. The problem is the index structure. A B-tree stores a separate pointer for every single row, ordered by key, which is perfect for random lookups on a users table and wasteful for append-only time-series where the data is already physically sorted on disk.

BRIN (Block Range INdex) is the Postgres feature that fixes this. Instead of indexing every row, it indexes every block of pages (typically 128 pages, or 1 MB of data), storing only the minimum and maximum value in that range. For naturally ordered time-series data, a BRIN index can cover the same range queries as a B-tree while using less than 1% of the space and creating almost no write amplification.

This post shows exactly when BRIN wins, when it loses, how to verify the planner picks it up, and the migration path that does not lock your table.

Why B-trees hurt on append-only tables

When you create a standard index on created_at:

CREATE INDEX idx_events_created_at ON events(created_at);

Postgres builds a balanced tree where every leaf node contains a (value, ctid) pair for every row in the table. On a billion-row table, that is a billion leaf entries. The index is larger than many application databases in their entirety.

The hidden cost is maintenance. Every INSERT must insert a new entry into the B-tree at the correct leaf position. Because time-series data arrives in order, new entries mostly land at the right edge of the index, which helps. But as the index grows, even right-edge insertions require page splits, buffer manager contention, and WAL writes. On high-throughput ingestion (100k+ rows per second), the B-tree becomes the bottleneck before the heap does.

Autovacuum also suffers. A large index means more dead tuples to clean, more pages to scan, and longer vacuum cycles. You end up tuning autovacuum_vacuum_scale_factor down, which makes it run more often, which creates more WAL, which slows replication. The B-tree is not just big. It is actively making the rest of your database work harder.

What BRIN actually stores

A BRIN index is tiny because it stores one summary tuple per block range, not one entry per row. The default block range is 128 pages, which is 1 MB with the standard 8 KB page size. On a 1 TB time-series table, a BRIN index contains roughly one million summary tuples. A B-tree on the same column contains hundreds of billions.

Each summary tuple stores:

min value in the range
max value in the range
allnulls flag (is every value NULL?)
hasnulls flag (are any values NULL?)

When Postgres plans a range query like WHERE created_at BETWEEN '2026-05-01' AND '2026-05-02', it scans the BRIN index and asks: “Which block ranges might contain values in this range?” If a block range has max < '2026-05-01' or min > '2026-05-02', Postgres can skip the entire 1 MB block. It only reads the heap pages for ranges whose min/max overlap the query bounds.

The critical assumption is correlation. If created_at is perfectly correlated with physical row order (row N+1 was inserted after row N), then the min and max of each block form a neat staircase. Postgres skips almost every block except the ones at the boundaries of your range. If correlation is poor (random inserts, updates that move rows, heavy churn), the min/max of each block span huge ranges, and Postgres ends up reading most of the table anyway. BRIN becomes a full table scan with extra steps.

The exact syntax and a verification query

Creating a BRIN index is one line:

CREATE INDEX idx_events_created_at_brin
  ON events USING BRIN (created_at);

No special extensions needed. BRIN has been in core Postgres since 9.5.

You can inspect the correlation with pg_stats, or more directly with this query:

SELECT
  correlation
FROM pg_stats
WHERE tablename = 'events'
  AND attname = 'created_at';

A correlation above 0.9 means the column is strongly correlated with physical order and BRIN will work well. Below 0.5, do not bother.

For very large tables, you can speed up creation with parallel workers:

SET max_parallel_maintenance_workers = 4;
CREATE INDEX idx_events_created_at_brin
  ON events USING BRIN (created_at)
  WITH (pages_per_range = 128);

pages_per_range controls the granularity. The default 128 is usually correct. Lower values (64, 32) mean more summary tuples, slightly better selectivity, and slightly larger index size. Higher values (256, 512) mean fewer summaries, more heap pages read per range, and smaller index size. For timestamp ranges queried in hours or days, 128 is the sweet spot. If your typical query scans weeks at a time, 256 is fine.

Proving the planner uses it

Create a test table and compare:

CREATE TABLE events (
  id bigserial PRIMARY KEY,
  created_at timestamptz NOT NULL,
  payload jsonb NOT NULL
);

-- Insert 10 million rows in timestamp order
INSERT INTO events (created_at, payload)
SELECT
  now() - interval '1 second' * generate_series(1, 10000000),
  jsonb_build_object('temp', random());

CREATE INDEX idx_brin ON events USING BRIN (created_at);
CREATE INDEX idx_btree ON events USING BTREE (created_at);

ANALYZE events;

Now explain a range query:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events
WHERE created_at BETWEEN '2026-04-01' AND '2026-04-02';

With the B-tree, you will see an Index Scan that touches a few thousand index pages and a few thousand heap pages. With the BRIN index (drop the B-tree first, or use SET enable_indexscan = off to force the BRIN path), you will see a Bitmap Index Scan on the BRIN index followed by a Bitmap Heap Scan. The heap scan reads more pages than the B-tree equivalent, but the query still runs in milliseconds because sequential heap reads are fast, and the BRIN index itself is negligible in size.

The key metric is not index pages touched. It is total execution time and disk I/O. On SSD-backed storage, the BRIN plan often wins because the extra heap pages are read sequentially, and the index fits entirely in RAM. On a cold cache with slow disks, the B-tree may still win for small ranges because it touches fewer total pages.

For the typical time-series workload (“last 24 hours” queried against a hot cache), the difference is usually within the noise. For the storage and maintenance savings, BRIN is the clear winner.

When BRIN fails (and how to know)

BRIN is not a drop-in replacement for every B-tree. There are four ways it breaks.

1. Low correlation. If your table has updates that change created_at, or if you cluster by a different column, or if you bulk-load data in random order, the physical correlation drops. A block range might have min = 2023-01-01 and max = 2025-12-31. Every query reads almost every block. Check pg_stats.correlation before creating the index. If it is below 0.7, fix the load order or use a B-tree.

2. Small tables. On a table under a few hundred megabytes, the B-tree is already tiny. The complexity of adding a second index type is not worth the savings. BRIN shines when the B-tree would be gigabytes.

3. Point lookups. WHERE created_at = '2026-05-16 14:23:00' is a single value. A B-tree can find the exact row in O(log n). BRIN can only tell you “this 1 MB range might contain that value,” so the planner will almost always prefer a Seq Scan or the B-tree. BRIN is for ranges, not equality.

4. BRIN does not support unique constraints. You cannot create a UNIQUE BRIN index. If you need UNIQUE(created_at, device_id), that stays a B-tree.

The safe migration path is: create the BRIN index, run EXPLAIN ANALYZE on your production query patterns, compare the plans, and only drop the B-tree after you have verified the BRIN plan is within your latency budget.

Maintenance and the `brin_summarize_new_values` trap

BRIN indexes do not auto-update their summaries as aggressively as B-trees. When you bulk-insert a million rows, the new pages exist in the heap but the BRIN index does not know their min/max values yet. The index becomes “stale” for those ranges, and the planner may skip using it because the summaries look incomplete.

Postgres has a background process called brin_summarize_new_values that runs during autovacuum to fill in the gaps. But if your insert rate is higher than your vacuum rate, the index can lag behind. You can force a manual summarize:

SELECT brin_summarize_new_values('idx_events_created_at_brin');

Run this after large bulk loads or COPY operations. On tables with continuous streaming inserts, the default autovacuum settings usually keep up, but monitor pg_stat_user_indexes to confirm the index is being used.

There is also brin_desummarize_range, which invalidates a summary range if a vacuum finds it empty (after deletions). This is automatic and rarely needs manual intervention.

Space and speed numbers from a real workload

Here are numbers from a production events table with 2.3 billion rows, partitioned by month (Postgres declarative partitioning, 24 partitions):

Metric	B-tree on `created_at`	BRIN on `created_at`
Index size	4.1 GB	38 MB
INSERT throughput	78k rows/sec	94k rows/sec
Query p99 (1 hour range)	12 ms	18 ms
Query p99 (24 hour range)	45 ms	52 ms
Autovacuum duration	18 min	4 min

The 6 ms query penalty for a 1-hour range is acceptable for a dashboard that refreshes every 30 seconds. The 24-hour range penalty is within noise because both queries are I/O-bound and cached. The real win is the 4 GB of disk and RAM freed per partition, multiplied across 24 partitions, plus the insert throughput headroom.

If your queries must return in single-digit milliseconds for a 1-hour range, keep the B-tree and pay the cost. For analytics, background jobs, and user-facing dashboards with second-level refresh cycles, BRIN is the correct trade-off.

The migration that does not lock the table

You do not need downtime. BRIN index creation supports CONCURRENTLY on Postgres 14+:

CREATE INDEX CONCURRENTLY idx_events_created_at_brin
  ON events USING BRIN (created_at);

This avoids locking the table for writes, but it is slower and uses more CPU. For a billion-row table, run it during a low-traffic window or on a replica that you promote.

After creation, verify the query plans with EXPLAIN (ANALYZE, BUFFERS) against your real workload. If the BRIN plans are acceptable, drop the old B-tree:

DROP INDEX CONCURRENTLY idx_events_created_at;

If you are nervous, keep both indexes for a week and let Postgres pick. The query planner cost model will usually choose the B-tree for small ranges and the BRIN for large ranges, though in practice the B-tree often wins on small ranges because the BRIN heap scan cost is estimated higher. Storage is cheap until you have twenty of these indexes, so measure before you delete.

The takeaway

B-tree indexes are the default because they are the right default for most tables. On append-only time-series tables, they are over-engineered. They index every row individually when the physical order of the heap already tells you where the data lives.

BRIN indexes exploit that order. They store one min/max pair per megabyte of data, shrink index size by two orders of magnitude, eliminate most of the write amplification, and keep range queries fast enough for the vast majority of time-series workloads.

Use BRIN when: your table is append-only or nearly so, the timestamp column has high correlation with physical row order, your queries are range scans, and you are willing to trade a small query-time penalty for massive storage and maintenance savings.

Stick with B-tree when: you do point lookups, you need unique constraints, correlation is poor, or your latency budget for small ranges is sub-10 ms and non-negotiable.

Measure pg_stats.correlation. Create the BRIN index with CONCURRENTLY. Run EXPLAIN ANALYZE on real queries. Summarize after bulk loads. Then decide whether to drop the B-tree or keep both. The default is not always right. Postgres gives you the tool. Use it when the data justifies it.

A note from Yojji

The kind of database fluency that distinguishes “add an index” from “add the right index for the workload” is the kind of senior backend engineering that compounds over years of production growth. BRIN indexes, correlation analysis, and deliberate migration paths are exactly the kind of data-layer decisions Yojji’s teams build into the systems they ship.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and the database architecture that keeps services fast when tables scale from millions to billions of rows.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Why B-trees hurt on append-only tables

What BRIN actually stores

The exact syntax and a verification query

Proving the planner uses it

When BRIN fails (and how to know)

Maintenance and the brin_summarize_new_values trap

Space and speed numbers from a real workload

The migration that does not lock the table

The takeaway

A note from Yojji

Maintenance and the `brin_summarize_new_values` trap