惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

罗磊的独立博客
T
Tenable Blog
人人都是产品经理
人人都是产品经理
IT之家
IT之家
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
小众软件
小众软件
美团技术团队
The GitHub Blog
The GitHub Blog
Y
Y Combinator Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Visual Studio Blog
M
Microsoft Research Blog - Microsoft Research
aimingoo的专栏
aimingoo的专栏
P
Proofpoint News Feed
T
The Blog of Author Tim Ferriss
博客园 - 聂微东
V
V2EX
Microsoft Security Blog
Microsoft Security Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
爱范儿
爱范儿
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
I
InfoQ
H
Help Net Security
Google DeepMind News
Google DeepMind News
P
Privacy International News Feed
U
Unit 42
Cyberwarzone
Cyberwarzone
V
Vulnerabilities – Threatpost
F
Future of Privacy Forum
雷峰网
雷峰网
Recorded Future
Recorded Future
WordPress大学
WordPress大学
P
Privacy & Cybersecurity Law Blog
博客园 - Franky
D
Darknet – Hacking Tools, Hacker News & Cyber Security
N
Netflix TechBlog - Medium
D
Docker
博客园_首页
J
Java Code Geeks
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Blog — PlanetScale
Blog — PlanetScale
C
CERT Recently Published Vulnerability Notes
Malwarebytes
Malwarebytes
MongoDB | Blog
MongoDB | Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Cisco Talos Blog
Cisco Talos Blog
T
Threat Research - Cisco Blogs
Know Your Adversary
Know Your Adversary
GbyAI
GbyAI

DEV Community

Git Time Machine — How Version Control Can Save Your Project My Dad Got an Electricity Bill He Couldn't Understand. Google I/O 2026 Just Made That Problem Solvable. My Dad Got an Electricity Bill He Couldn't Understand. Google I/O 2026 Just Made That Problem Solvable. Reviving My Coding Project with GitHub Copilot I Tried Gemini 3.5 Flash After Google I/O 2026 - Here is What I Found :)) Zero-Cost AI in VS Code Blueprints Might Be More Important Than Frameworks AI CareCompanion - Offline Health Assistant Long-Context Models Killed RAG. Except for the 6 Cases Where They Made It Worse. I Built a Neural Network Engine in C# That Runs in Your Browser - No ONNX Runtime, No JavaScript Bridge, No Native Binaries An In-Depth Overview of the Apache Iceberg 1.11.0 Release Your Agent Just Called the Same Tool 47 Times. Here's the 20-Line Detector. How I Built a Multi-System Astrology Bot in Python (And What Meta Banned Me For) Gemma 4 Has Four Variants. Here's How to Pick the Right One Before You Write a Single Line of Code. Log Level Strategies: Balancing Observability and Cost Why WebMCP Is the Most Important Thing Google Announced at I/O 2026 (And Nobody's Talking About It) Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch Google's 2x Energy Efficiency Claim Is Real — But Here's What They're Not Measuring What's actually going on with CORS, under the hood Language-Agnostic Code Generation: The Driver Plugin Model Why We Rewrote Our Python CLI in Go (and What We Gained) I added up everything Google gives developers for free after I/O 2026. It's kind of absurd The Dawn of Smarter Apps: My Take on Google I/O 2026 AI Announcements Why AI Agents Like Hermes Need a Semantic Execution Layer for the Physical World Why We Built TestSmith: The Test Coverage Problem Nobody Talks About How to Convert Bank Statement PDFs to Excel: The Complete 2026 Guide Have You Ever Used a Website That Keeps Working After You Turn Off Your Internet? From idea to indexed: how I launched a SaaS in 60 days with Laravel + React Building a local-first AI tutor for my daughter (and 10–14 year-olds in Austrian schools) with Gemma 4 EC2 SSH Not Connecting? Here Are the 5 Things That Were Wrong (And How I Fixed Them) Best AI Tools for HVAC Contractors 2026 From Closed Internal Stack to Open-Source Ecosystem: I Finally Shipped Three Years of .NET Infrastructure Scrumpan is offlically LIVE!! Building a BMI Calculator CLI with TypeScript — Types, Functions, and Vitest From Building WordPress Websites to Node.js APIs: My Honest Full Stack Journey XiHan Snore Coach: Privacy-First On-Device MedTech Guardian powered by Gemma 4 Mobile Why AI Coding Agents Hallucinate and How to Fix It mcp-probe v1.4.0: Contract assertions for production MCP servers Google I/O 2026 Wasn't About One More Model. It Was About the Agent Stack. How I built 100+ crypto calculators in 6 languages on Astro The Dawn of Local Multi-Agent Architectures: Why Gemma 4 Changes Everything for Cloud Developers # I Told My AI to Simulate a Planet for 10,000 Years. It Built the Whole Thing Itself. 18/30 Days System Design Questions! From Hackathon Chaos to Clean CLI: Reviving My Daily Routine Analyser with GitHub Copilot Building a Home Lab with Proxmox and Terraform (for Kubernetes) PolicyAware vs Guardrails vs AI Gateways vs Model Routers: The Comparison Every AI Engineer Needs to Read Partner: An AI That Does Research While You Sleep Rugby Fundamentals as Software Concepts - Mapping the Pitch to your Code Base I Let Claude Code Run Unsupervised for 24 Hours. Here's What Happened. Why Zed Is Replacing VS Code in My AI-Augmented Workflow Build a scroll-driven WebGL hero in 30 lines Karpathy's LLM Wiki? No Code with Claude or Github Copilot! Why Platform Governance and Transparency Matter for Developers and Freelancers I built a Flutter CLI that generates Clean Architecture in seconds Using an LLM to automate a task that used to take hours by hand CyberArena – Interactive Cyber Security Simulation & Threat Analysis Platform Tile Extractor Mathematical Functions in CSS: clamp, min, max and How They Simplify Responsiveness Polyglot Persistence in Microservices: Let the Domain Choose the Database 190 Countries, Zero API Calls: Shipping Static Data in a Chrome Extension Your AI Writes Code Fast. Here’s How to Check It Before Shipping qwen2.5-coder is too slow for Claude Code on a Mac. Here's the fix. Building Automated Text-to-Video Pipelines with AI Can Gemini Become an Offline AI Tutor? Lessons from Building Educational AI OPRIX : From a simple messaging web app to a well structured and enhanced UI messaging web app Why React + TypeScript Nullability Slowly Becomes Exhausting Why AI Agents Need a Project Layer - Part 1 Stop Hand-Editing MCP Configs: A Zero-Dependency Go CLI What I Learned Working With Microsoft, SQUAD(GTCO), and Different Tech Communities 🧠 Hermes Agent Assistant — A Modular AI Agent System with Planner, Executor & Memory Spring Boot Auto-Configuration Source Code: Nail This Interview Question The Ultimate Guide to Free AI API Keys: 6 Platforms You Need to Know Why 91% of AI Agents Fail in Production (And What the 9% Do Differently) TryHackMe | Battery | WALKTHROUGH Stop Guessing Your Regex — Test It Live in the Browser I Built FreelancEye, an Open-Source Mobile PWA for Finding Clients Beyond the Hype: My Production Playbook for Docker Swarm Top AI App Builder Platforms with Integrated Backend, Hosting & Database ECS vs EKS in 2026: An Honest Comparison from Someone Who Has Run Both in Production Hardening Your Node.js App Against Supply Chain & Remote Code Execution Attacks linux commands A Practical GEO Case: How an AI System Started Recommending Our Blog Your AI Agent Works 24/7 and Earns $0. I Built the Fix. Your AI Trading Agent Will Lose All Your Money — Here's How To Stop It Google I/O 2026: What Happens When Everything Connects? Why AI writes software but doesn’t build a good product Beyond the Hype: How Google I/O 2026 Secretly Democratized Production-Ready AI Agents with Managed Sandboxes. The Killer Assumption Test: How to Spot Doomed Product Decisions Before You Ship Stop Describing Your Bugs — Just Screenshot Them # I Built an AI Website Builder and Here's What Actually Happened Cooking an AI Campaign in 5 Minutes with Google Cloud AI APIs Your PM Retrospectives Are Lying to You How I Built a Free, Self-Hosted Pipeline That Auto-Generates Faceless YouTube Shorts TypeScript 54 to 58: The Features That Actually Matter in 2026 How to Tailor Your CV to Any Job Posting in 2026 The 7-day SaaS MVP loop: ship fast, then validate with people who actually show up 95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job What Is a Frontend Developer Roadmap and Why You Need One Google shipped three Gemini "Flash" models. Picking the wrong one could 6 your AI bill Building an MCP server so Claude can query my SaaS analytics directly
Read Replicas Lie About Consistency. 4 Sync Modes Behind the Lie.
Gabriel Anha · 2026-05-24 · via DEV Community

Your read replica is lying to you, and "eventual consistency" is the lie's polite name. The user clicks Pay, the API returns 200, the next page reads the order from a replica, and the row isn't there yet. Support gets the ticket. You blame "replication lag" like it's weather.

It isn't weather. It's a config choice, and most Postgres clusters are running the default.

The lie, in one paragraph

A read-after-write that returns stale data is a contract violation. The user gave you a write, you accepted it, you confirmed it, and then you served them old data on the next request. There's no philosophical defence of that. "We're eventually consistent" is what you say to the postmortem doc when you didn't pick a stronger guarantee at config time.

Postgres gives you five settings on a single knob called synchronous_commit, and they buy completely different guarantees at completely different costs. Only one of them prevents a read replica from serving stale rows. Most teams run a global default of on and have never typed the other four values into a session.

The 4 (well, 5) sync modes, ranked by guarantee

synchronous_commit has five legal values. Ranked from "fire and forget" to "wait until the replica has the row visible to queries":

  1. off: primary returns commit before fsync to its own WAL. A primary crash inside the next few hundred milliseconds loses the transaction. Don't ship this unless you really mean it.
  2. local: primary returns commit after its own WAL fsync. Replica state is irrelevant. This is the value when no synchronous_standby_names is set.
  3. on (the default) / remote_write: primary waits for the synchronous standby to receive the WAL into its kernel. Not flushed to disk. Not replayed. A replica OS crash before flush can still lose it.
  4. remote_flush: primary waits for the synchronous standby to fsync the WAL to its disk. The replica can survive a crash and still have the record. But it hasn't been replayed yet, so a SELECT on that replica won't see the row.
  5. remote_apply: primary waits for the synchronous standby to apply the WAL, actually replaying it into the database. A SELECT on that replica will see the row immediately after the COMMIT returns.

The first four are about durability. Only the last one is about read-after-write consistency on the replica. That's the whole point of this post: if you want the replica to never serve stale data, remote_apply is the only setting that gives you that.

Everything else is theatre.

What each mode guarantees vs costs

A rough latency table from a 3-node Postgres 16 cluster in one AWS region (us-east-1, gp3, c6i.large), measured with pgbench -c 16 -j 4 -T 60 -N. Treat numbers as order-of-magnitude, not benchmarks:

Mode Survives primary crash? Survives replica crash? Replica reads after-write? p99 commit latency
off no no no ~0.4 ms
local yes no no ~1.2 ms
on / remote_write yes no no ~2.0 ms
remote_flush yes yes no ~3.5 ms
remote_apply yes yes yes ~6.0 ms

The spread between off and remote_apply is roughly 15x. Across a region (say a standby in Frankfurt and a primary in Dublin, ~25 ms RTT) the same table shifts: remote_apply becomes 50-60 ms per commit. That's not a footnote. That's a product decision.

Most teams who default to on and assume "we're durable" are running with the replica can crash and lose the record setting. That's fine for a lot of workloads. It is not fine for payment intent records or permission grants.

Synchronous standbys and the quorum trick

synchronous_commit interacts with one other setting: synchronous_standby_names. Without that one set, none of the remote_* modes do anything. There's nobody to wait for, so the primary degrades silently to local.

The naive form names a single standby:

synchronous_standby_names = 'replica-fra-1'

Enter fullscreen mode Exit fullscreen mode

The problem with naming one standby is obvious. If replica-fra-1 is down or partitioned, your primary blocks on every write until it comes back. You traded read-staleness for write-availability, and you didn't get a discount.

The fix is quorum commit. Postgres 10+ supports:

synchronous_standby_names = 'ANY 2 (replica-fra-1, replica-fra-2, replica-fra-3)'

Enter fullscreen mode Exit fullscreen mode

Read this as: "wait until any 2 of these 3 standbys acknowledge." If replica-fra-1 is down, the other two ack and the commit goes through. You stay durable across two replicas, you stay available as long as two are healthy, and you pay the latency of the second-fastest standby instead of one specific one.

There's a FIRST n (...) form too (strict priority order), but ANY n is the one to reach for. It's the cluster equivalent of a Raft quorum without the consensus protocol.

Per-transaction synchronous_commit: the feature nobody uses

Most teams miss this: synchronous_commit is a session/transaction-scoped GUC. You don't have to run the cluster on one global setting. You can pick the right guarantee per write.

Run the cluster at on (cheap default, ~2 ms commits, replicas may lag a beat). Then, for the specific writes that need read-after-write consistency, escalate inside the transaction:

# checkout flow: must be visible on read replica
# before the redirect to the confirmation page
async def confirm_payment(order_id: str, intent_id: str) -> Order:
    async with db.transaction() as tx:
        # this transaction only, not the whole connection
        await tx.execute(
            "SET LOCAL synchronous_commit = 'remote_apply'"
        )

        await tx.execute(
            """
            UPDATE orders
               SET status = 'paid',
                   stripe_intent_id = $1,
                   paid_at = NOW()
             WHERE id = $2
               AND status = 'awaiting_payment'
            """,
            intent_id,
            order_id,
        )

        # the row visible on every replica before COMMIT returns
        order = await tx.fetchrow(
            "SELECT * FROM orders WHERE id = $1",
            order_id,
        )

    # safe to redirect. confirmation page reads
    # from a replica and will see status='paid'
    return Order.from_row(order)

Enter fullscreen mode Exit fullscreen mode

SET LOCAL is the keyword that matters. It scopes the change to the current transaction. The connection goes back to the cluster default after COMMIT. You don't have to remember to reset it. You don't poison the next query on the same pooled connection.

Now the rest of your traffic (view counts, click events, "user toggled dark mode") runs at cheap on. The one write where stale reads break the product runs at remote_apply and pays 6 ms instead of 2. You bought consistency where it mattered. You didn't pay for it everywhere else.

The same trick applies the other direction. If you have a hot write path that you genuinely don't care about durability for (say a last_seen_at timestamp updated on every API call), drop it to local or off for that one statement:

BEGIN;
SET LOCAL synchronous_commit = 'off';
UPDATE users SET last_seen_at = NOW() WHERE id = $1;
COMMIT;

Enter fullscreen mode Exit fullscreen mode

The cluster keeps its strong default, and this one statement gets to be fast and slightly lossy on purpose.

The failure mode you'll hit first

A synchronous primary with no healthy standby hangs every write. This is not a bug. This is the literal definition of synchronous replication: the primary refuses to acknowledge a commit until its synchronous partner agrees, and if there's no synchronous partner, there's no agreement, so there's no acknowledgement.

You see it as:

  • New connections succeed.
  • SELECT queries succeed.
  • INSERT, UPDATE, DELETE, COMMIT block forever.
  • Your APM dashboard goes red on write p99.

Operators look at the primary and say "Postgres is up, it's healthy, what's the problem." The problem is that synchronous_standby_names = 'replica-fra-1' is set, replica-fra-1 is the only listed standby, it's down, and Postgres is doing exactly what you asked. The mitigations:

  • Use the quorum form (ANY n (...)). Survives any single standby loss.
  • Set up monitoring on pg_stat_replication.sync_state so you find out before users do.
  • For non-critical workloads, accept the on default and don't promise read-after-write consistency on a replica at all.

Don't fix it by editing synchronous_standby_names to an empty string during the incident, by the way. That degrades the cluster to async behind your back, and the runbook tomorrow won't remember why.

Verifying which mode is actually active

Two queries to know. First, what does the primary think the setting is:

SHOW synchronous_commit;
SHOW synchronous_standby_names;

Enter fullscreen mode Exit fullscreen mode

Second, what's the live replication state. Which standbys exist, which are sync, which are async, where are they in WAL:

SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    sync_priority,
    pg_wal_lsn_diff(sent_lsn, write_lsn)   AS write_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, flush_lsn)   AS flush_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, replay_lsn)  AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY sync_priority;

Enter fullscreen mode Exit fullscreen mode

sync_state is the column to stare at:

  • sync: a synchronous standby. Primary waits on it.
  • quorum: part of an ANY n quorum. Primary waits on enough of the group.
  • potential: listed in synchronous_standby_names but not currently chosen as sync.
  • async: purely asynchronous. Primary never waits on it.

The three lag columns tell you where each replica is: write_lag is the gap until the standby received the WAL, flush_lag until it fsynced, replay_lag until it applied. replay_lag is the one users feel when they read from a replica.

Alert on replay_lag > 1 second on any standby you serve user reads from. That's not a "warn at 30 minutes" metric. That's a contract.

A decision tree, per write

For each write your service does, ask in order:

  1. Can a stale read on a replica break a user-visible contract? Checkout confirmation, permission grant, "your account is now verified", yes. Use remote_apply, per-transaction with SET LOCAL.
  2. Must this survive a primary crash plus a replica crash? Audit log, payment intent, idempotency key, yes. At least remote_flush. Probably also remote_apply if it's followed by a replica read.
  3. Must this survive a primary crash but a lost replica is acceptable? Most application data. on is fine.
  4. Is durability negotiable for throughput? Last-seen timestamps, view counts, ephemeral session bumps. Use local or off, per-statement with SET LOCAL.

The trap is question 1. Teams skip it because the default of on sounds durable, and they conflate "durable on primary" with "visible on replica." Those are different guarantees on different machines, separated by however much WAL is in flight.

If a checkout-flow read-after-write returns stale because your replica hadn't applied yet, no amount of "the write was durable on the primary" gets you off the hook with the user staring at an empty cart. remote_apply is the answer. Use it where it matters, leave it off everywhere else, and stop blaming the weather.


If this was useful

If picking the right durability and consistency settings per write is the kind of thing you want a full chapter on (not just Postgres, but Mongo write concerns, Cassandra tunables, Redis persistence modes, and the same trade-off shape across each), that's exactly what the Database Playbook is built around. Each store gets its own chapter on consistency knobs, what they cost, and which writes deserve the expensive setting.

Database Playbook: Choosing the Right Store for Every System You Build