Postgres Read Replicas: Stop Serving Stale Data to Your Users

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-13 · via The Practical Developer

Your user clicks Create Project. The API writes to the primary Postgres, returns 201 Created, and the frontend redirects to /projects/42. The project detail endpoint reads from the replica. It gets back zero rows. The UI renders a 404 page. The user refreshes twice. On the third refresh, the project appears. They open a support ticket that reads “your app is broken.”

This is the read replica tax. You added replicas to take load off the primary, a good decision. Then you pointed SELECT queries at them, also correct. What most teams miss is that replication is asynchronous. Even under normal load, a read replica can lag by 1–50 ms. Under load, batch replication, or vacuum activity, that lag can stretch to hundreds of milliseconds or seconds. The user does not care about your architecture diagram. They care that their data is gone.

This post shows the routing pattern that keeps the performance win without the stale-data bugs. It is three pieces: measuring lag, request-scoped write tracking, and a fallback gate. All of it is boring infrastructure code that pays rent every day.

The naive split and why it fails

The first implementation is almost always a connection pool split:

// pools.js
import pg from 'pg';
export const primaryPool = new pg.Pool({ connectionString: process.env.PRIMARY_URL });
export const replicaPool = new pg.Pool({ connectionString: process.env.REPLICA_URL });

// usage
export async function getProject(id) {
  const { rows } = await replicaPool.query('SELECT * FROM projects WHERE id = $1', [id]);
  return rows[0];
}

export async function createProject(data) {
  const { rows } = await primaryPool.query(
    'INSERT INTO projects (name) VALUES ($1) RETURNING *',
    [data.name]
  );
  return rows[0];
}

This works in unit tests because latency is zero. In production it fails whenever:

A write is followed by a read in the same request (create-then-fetch, update-then-list).
A background job enqueues work that reads its own write.
A webhook handler writes state, then a provider immediately queries for it.
Replication lag spikes because of a long-running transaction on the primary or a replica vacuum.

The bug is not intermittent. It is deterministic under the right timing, which means it will show up in demos, user onboarding flows, and integration tests running in CI with real infrastructure.

Measure before you route

You cannot fix what you do not measure. Postgres exposes replication lag in two useful ways. On the replica:

SELECT
  now() - pg_last_xact_replay_timestamp() AS lag;

This gives you wall-clock lag, but pg_last_xact_replay_timestamp() returns null if no transaction has replayed since the replica started. A more robust metric, if you have pg_stat_replication access on the primary, is:

-- run on the primary
SELECT
  client_addr,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

For application-layer routing, the simplest approach is a heartbeat table. Create it on the primary:

CREATE TABLE replication_heartbeat (
  id INT PRIMARY KEY,
  tick TIMESTAMPTZ NOT NULL DEFAULT now()
);
INSERT INTO replication_heartbeat (id) VALUES (1)
ON CONFLICT (id) DO NOTHING;

Update it every few seconds from a background process:

setInterval(async () => {
  await primaryPool.query(
    'UPDATE replication_heartbeat SET tick = now() WHERE id = 1'
  );
}, 5000);

Query it from the replica and compare:

async function getReplicaLagMs() {
  const primary = await primaryPool.query(
    'SELECT tick FROM replication_heartbeat WHERE id = 1'
  );
  const replica = await replicaPool.query(
    'SELECT tick FROM replication_heartbeat WHERE id = 1'
  );
  if (!primary.rows[0] || !replica.rows[0]) return Infinity;
  const lag = primary.rows[0].tick - replica.rows[0].tick;
  return lag; // milliseconds in pg
}

If lag exceeds a threshold (say 100 ms) you stop routing to the replica until it catches up. This is your circuit breaker. It is coarse, but it prevents the worst-case scenario where a lagging replica serves minutes-old data during a failover or heavy batch load.

Request-scoped routing: the write tracker

The real fix is tracking writes inside a single request and routing subsequent reads to the primary for that request only. This gives you read-after-write consistency without abandoning replicas entirely.

In Express, use AsyncLocalStorage to carry a “writes seen” flag:

import { AsyncLocalStorage } from 'node:async_hooks';

const requestStorage = new AsyncLocalStorage();

export function withRequestContext(handler) {
  return (req, res, next) => {
    const store = { wroteToPrimary: false };
    requestStorage.run(store, () => handler(req, res, next));
  };
}

export function markWrite() {
  const store = requestStorage.getStore();
  if (store) store.wroteToPrimary = true;
}

export function shouldUsePrimary() {
  const store = requestStorage.getStore();
  return store ? store.wroteToPrimary : false;
}

Wrap your route handlers:

import express from 'express';
import { withRequestContext, markWrite, shouldUsePrimary } from './context.js';

const app = express();
app.use(withRequestContext);

Update your data layer to mark writes:

export async function createProject(data) {
  markWrite();
  const { rows } = await primaryPool.query(
    'INSERT INTO projects (name) VALUES ($1) RETURNING *',
    [data.name]
  );
  return rows[0];
}

And choose the pool dynamically:

export function getPool() {
  return shouldUsePrimary() ? primaryPool : replicaPool;
}

export async function getProject(id) {
  const pool = getPool();
  const { rows } = await pool.query('SELECT * FROM projects WHERE id = $1', [id]);
  return rows[0];
}

Now the flow works: POST /projects calls createProject, which sets wroteToPrimary = true. The redirect to GET /projects/42 calls getProject, sees the flag, and reads from the primary. The project is there. Every other request on the system that has not written continues to hit the replica.

This is not session stickiness. Session stickiness pins a user to one database for their entire session, which defeats the purpose of replicas if the user is write-heavy. Request-scoped routing is narrower: it lasts for one HTTP request only.

The lag fallback

Combine the write tracker with a lag gate. Even if a request has not written, you may want to avoid the replica when it is unhealthy:

let cachedLagMs = 0;
let lagCheckedAt = 0;

async function getPool() {
  const store = requestStorage.getStore();
  if (store?.wroteToPrimary) return primaryPool;

  // Cache lag for 1s to avoid hammering the heartbeat query
  if (Date.now() - lagCheckedAt > 1000) {
    cachedLagMs = await getReplicaLagMs();
    lagCheckedAt = Date.now();
  }

  if (cachedLagMs > 100) {
    return primaryPool; // fallback
  }

  return replicaPool;
}

Pick a threshold that matches your product requirements. A 100 ms lag gate is safe for most web apps. A 5 ms gate is appropriate for high-frequency trading. A 5 second gate is appropriate only if your users already expect eventual consistency.

Expanding the write tracker

Not every read-after-write pattern is inside one request. Background jobs often write, then read. For those, pass an explicit consistency hint:

export async function getProject(id, { consistent = false } = {}) {
  const pool = consistent || shouldUsePrimary() ? primaryPool : replicaPool;
  const { rows } = await pool.query('SELECT * FROM projects WHERE id = $1', [id]);
  return rows[0];
}

Use it in the job:

const project = await createProject(data);
const full = await getProject(project.id, { consistent: true });
await enqueueIndexJob(full);

Another expansion: mark writes for the whole request if the user is an admin or if the endpoint is known to be write-heavy. The mechanism is the same: set the flag early, route everything to the primary for that request.

Testing the stale-data path

A test suite that only mocks the database will never catch replication lag bugs. You need an integration test that talks to a real primary and replica. In CI, use Docker Compose:

services:
  postgres_primary:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    volumes:
      - ./primary-init.sql:/docker-entrypoint-initdb.d/init.sql

  postgres_replica:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    depends_on:
      - postgres_primary

Set up streaming replication in the init scripts, then write a test that deliberately introduces lag. The simplest way is to pause replication:

-- on the replica
SELECT pg_wal_replay_pause();

Then run your create-and-fetch flow. Assert that without the write tracker, the fetch returns null. Assert that with the write tracker, it returns the record from the primary. Then resume:

SELECT pg_wal_replay_resume();

If you cannot pause replication in your test environment, simulate lag by adding a pg_sleep(0.1) in the replica query path and running the fetch immediately after the write. It is not identical, but it exercises the routing logic.

Monitoring that matters

Add three metrics:

Query routing split: count of reads on primary vs replica. A healthy system routes 80–95% of reads to replicas. If that drops below 50%, either your write tracker is too aggressive or your replicas are constantly lagged.
Replication lag histogram: track the lag distribution, not just the average. P99 lag is what breaks user flows. Alert when P99 exceeds your threshold.
Fallback rate: how often the lag gate forces a read to the primary. A spike here is a leading indicator of replica trouble.

// pseudo-metric with any client
replicaLagHistogram.observe(lagMs);
readRoutingCounter.inc({ destination: shouldUsePrimary() ? 'primary' : 'replica' });

Do not alert on single-request 404s. Those are symptoms. Alert on lag and fallback rate; fix the root cause before users notice.

The full middleware

Here is the minimal Express integration in one block:

import { AsyncLocalStorage } from 'node:async_hooks';
import pg from 'pg';

const requestStorage = new AsyncLocalStorage();

const primaryPool = new pg.Pool({ connectionString: process.env.PRIMARY_URL });
const replicaPool = new pg.Pool({ connectionString: process.env.REPLICA_URL });

export function dbMiddleware(req, res, next) {
  requestStorage.run({ wroteToPrimary: false }, next);
}

export function markWrite() {
  const store = requestStorage.getStore();
  if (store) store.wroteToPrimary = true;
}

export async function query(sql, params) {
  const store = requestStorage.getStore();
  const pool = store?.wroteToPrimary ? primaryPool : replicaPool;
  return pool.query(sql, params);
}

// In routes:
// app.use(dbMiddleware);
// app.post('/projects', async (req, res) => { markWrite(); ... });
// app.get('/projects/:id', async (req, res) => { const r = await query(...); ... });

This is fewer than forty lines. The hard part is not the code. It is accepting that replication lag is not a bug you can eliminate; it is a property of the system you must design around.

When to not use replicas at all

Read replicas are not free. They add network hops, operational complexity, failover logic, and the exact class of bug this post describes. If your primary is under 50% CPU and your working set fits in RAM, you may not need replicas yet. A single large Postgres instance with good indexes and query tuning will outperform a poorly routed replica setup every time.

Add replicas when:

Read CPU on the primary is consistently above 70%.
You have long analytical queries that would stall OLTP traffic.
You need a hot standby for failover.

Do not add them because a blog post told you to scale horizontally. Horizontal scaling is a liability until you actually need it.

The takeaway

The pattern is: write to primary, track writes per request, read from primary for the rest of that request, read from replica otherwise, and fall back to primary when lag is high. That gives you almost all the performance of replicas without the stale-data bugs.

Build the write tracker before you need it. The incident that forces you to build it under pressure will happen on a Friday, during a product demo, and the CEO will be the one who sees the 404. Build it on a Tuesday instead.

A note from Yojji

Database infrastructure that actually works under load (replication, routing, failover, and the monitoring that tells you when it is lying) is the kind of backend work that looks invisible when it is done right and existential when it is not.

Yojji is an international custom software development company founded in 2016, with offices across Europe, the US, and the UK. Their teams build custom web applications and scalable backend systems with the same focus on operational correctness, query performance, and infrastructure reliability that keeps read replicas useful instead of dangerous.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

The naive split and why it fails

Measure before you route

Request-scoped routing: the write tracker

The lag fallback

Expanding the write tracker

Testing the stale-data path

Monitoring that matters

The full middleware

When to not use replicas at all

The takeaway

A note from Yojji