惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
MySQL semi-sync replication: durability consistency and split brains — PlanetScale
Shlomi Noach · 2020-10-02 · via Blog — PlanetScale

Shlomi Noach |

MySQL semi-sync is a plugin mechanism on top of asynchronous replication, that can offer better durability and even consistency (term defined later). It helps in high availability solutions, but can in itself reduce availability. We look at some basics and follow up to present scenarios that require higher level intervention to ensure availability and to avoid split brains from taking place.

I recommend reading this semi-sync blog post by Jean-François Gagné (aka JFG), which illustrates the internals of the semi-sync implementation, and debunks some myths about semi-sync. We will overlap a bit with another recommended post by JFG, about high availability and recovery.

Note: in this post we adopt the term “primary” over the term “master” in the context of MySQL replication. However, at this time there is no alternative to using the actual names of some configuration and status variables that use “master” terminology, and some duality is present.

Overview

As quick recap, semi-synchronous replication is a mechanism where a commit on the primary does not apply the change onto internal table data and does not respond to the user, until the changelog is guaranteed to have been persisted (though not necessarily applied) on a preconfigured number of replicas. We limit our discussion to MySQL 5.7 or equivalent.

Specifically, the primary is configured with rpl_semi_sync_master_wait_for_slave_count, which is 1 or above, if semi-sync is to be used. An INSERT transaction, for example, will not report being committed (i.e. the user will not get “OK”) until at least that number of replicas have acknowledged receipt of that transaction’s changelog (entries in the binary log). In practice, this means at least that number of replicas have written the changelog onto their relay logs where the change is now persisted. The replicas will apply the change later on, normally as soon as they can. For the scope of this post, we assume replicas are not intentionally stopped, and we can also ignore delayed replicas.

The primary waits up to rpl_semi_sync_master_timeout, after which it falls back to asynchronous replication mode, committing and responding to the user even if not all expected replicas have acknowledged receipt. In the scope of this post, we are interested in an “infinite” timeout, which we will accept to be a very large number.

The only replicas to acknowledge receipt of the changelog are replicas where semi-sync is enabled. There could be other, non semi-sync replicas in our topology. They, too, pull the changelog from the primary, but do not acknowledge receipt. As JFG points out in his post, the semi-sync replicas aren’t necessarily the most up to date. It’s possible that a non-semi-sync replica has pulled more changelog data than some or all semi-sync replicas.

Durability

The most straightforward use case for semi-sync is to guarantee durability of the data. We only approve a commit once the transaction (in the form of a changelog) is persisted elsewhere, on a different server, or on multiple servers.

As mentioned earlier, this does not mean the changelog has been applied on those servers. Replication may be lagging and the servers may be too busy to be able to apply the changelog in a timely fashion. But, at the very least, if the primary crashes, there’s “forensic evidence” in the relay logs of a semi-sync replica that can tell us what the latest changes were.

Consistency

The term “consistency” is overloaded, and especially in distributed systems. The CAP Theorem definition of consistency is, for example, very strict: once data was written on one server, any immediate follow-up read on any other server must reflect that write (or any later write).

We will suffice with eventual consistency. This in itself an overloaded term, so let’s clarify: in the case of a primary outage, we consider our system consistent if we’re able to promote a new primary within any amount of time (obviously in practice we expect the time to be short) such that it is consistent with the previous primary’s advertised state. I write “advertised” state because the end user or applications should not be surprised by any data changes when the new primary is promoted, regardless of how transactions/replication work internally.

Split brain

A split brain is a situation where there are two servers which are simultaneously taking writes, each thinking they’re the (single) primary. In a split brain scenario the data diverges between the two servers. If those servers also have replicas, then we have two divergent replication trees. Fixing and merging those divergent trees is difficult, and normally we revert the changes on one to make it look like the other (whether by backup+restore or by some rollback method).

Reiterating JFG’s observation that while engineers normally frown upon split brains, the damage of split brain may well be within a product’s allowance.

Topologies and scenarios

As always, getting best durability as possible, best consistency as possible, best response times as possible, best split brain mitigation as possible, is costly. Product owners and engineers make reasonable tradeoffs.

Let’s consider a few configured topologies, and see what scenarios we can expect upon failure. In particular, how durable our data is, can we expect it to be consistent, and can we expect split brains.

General scenario: wait for single replica, multiple semi-sync replicas

In this common scenario our primary is configured with rpl_semi_sync_master_wait_for_slave_count=1, and there are multiple (more than one) replicas configured with semi-sync. Let’s say there are 4 semi-sync replicas, named R1, R2, R3, and R4. We nickname this setup “1-n”

In this setup, a write takes place on the primary, and before approving the write to the user/app, the primary waits for an acknowledgement from exactly one replica. It doesn’t matter which. For some writes, it will be replica R1; at some other time, R4 is the first to acknowledge.

It’s important to note that the changelog, the binary log, is sequential. A replica that acknowledges some changelog event, has necessarily received all of its prior events. This is why the primary doesn’t need to care that different acknowledgements come from different replicas. An acknowledgement means there’s a replica with full durability of committed transactions up to and including the one in question.

Should the primary fail, we have at least one replica guaranteed to have all approved commits in its relay logs. These are not necessarily applied, but if our replication cluster is normally low-lagged, we can expect those commits to be few, and we can expect the replica to catch up (apply those commits) within a few seconds or less.

A possible situation is that there may be non-semi-sync replicas even more up to date than our semi-sync replica. We can choose to use them. The fact they have more transactions than the most up-to-date semi sync replica means they have received binary log events not acknowledged by a semi-sync replica. While this may seem confusing at first, these are fine to accept, and do not contradict our promise to our users. We promised that “if the primary tells you a commit is successful, then the data is durable elsewhere”. We never said anything about the state of the data before telling the user the commit is successful. In fact, this scenario is no different than the primary actually getting a commit acknowledged by the replica, begins to communicate that to the user, then crashing halfway. The user cannot distinguish between the two cases at the time of failure (though post crash analysis may indicate which of the two was true).

And so we can choose to use the data from that even-more-up-to-date non-semi-sync replica, seed the rest of the replicas with that data, or, we can choose to throw away any server that is more-up-to-date than our most-up-to-date semi-sync replica and recycle them. It’s our choice and it’s down to operational considerations (time, capacity, load, ...).

If data is durable, does that mean it’s available?

Not necessarily. Consider this possible, even likely scenario: the primary and R1 are both in the same data center. R2, R3, R4 are in different availability zones, possibly remote, and so have higher latency than R1 communicating to the primary.

The app issues an UPDATE on the primary. R1 acknowledges the write. R2, R3, R4 do not, as yet. But the primary does not care, it commits and reports success to the app. Then the primary and R1 both get network isolated together because the DC they’re in has lost network. The loss of network cut short the delivery of the change log to R2, R3, R4. They never got it.

Should we promote either R2, R3, or R4, we lose consistency. The app expects the result of that UPDATE to be there, but neither of these servers have the data. Moreover, even human intervention is limited. Since the data is only found on primary and R1, we cannot get hold of the data. A human can always transport to the physical server to extract the data and send it over via cellular network, or even by car. It’s likely that this will take substantial time.

A predefined failover plan may choose to promote, say, R2, after all, losing some data, but cutting short on outage time. Reconciliation will have to take place afterwards.

Do we even know the status?

If the primary and one of the semi-sync replicas go offline, we have an additional problem: we can’t say for sure whether any of the remaining semi-sync replicas got the latest updates from the primary. It’s possible that they did. But it’s also possible that the single semi-sync replica that went offline, was the single one to get and acknowledge some last writes taking place on the primary.

To generalize, if the number of lost semi-sync replicas is equal to, or greater than rpl_semi_sync_master_wait_for_slave_count, then we do not know whether the remaining replicas are consistent.

Split Brain

But, arguably worse, is that we can now be in a split brain situation. Apps in the primary datacenter were still writing to the old primary when the network went down. Those writes continued to run. The old primary now receives writes never seen by R2 (the newly promoted primary). Once we do wish to reconcile, we are faced with contradicting information.

Physical locations

The geo-distribution of our servers plays a key part in how tolerant our system is for failure and what outcomes we can expect.

Colocated primary and semi-sync replicas

On a normal day, writes to the primary have low latency: acknowledgements from semi-sync replicas are quick to arrive thanks to the fast network within a datacenter.

When the primary fails, we have at least one good semi-sync replica to promote (or we can use it to seed a different server to promote). Because we promote within the same datacenter, app behavior is unlikely to be affected, other than the obvious disruption during the failover time.

If the datacenter is network-isolated, we need to make a decision: wait out the outage, incurring downtime to our services, or promote a new primary in another datacenter. We risk losing transactions (we cannot know whether we lose transactions until we are again able to connect to the original primary). We also risk a split brain scenario.

Cross-site semi-sync replication

If we only run semi-sync replicas in a different datacenter than the primary’s, we first pay with increased write latency, as each commit runs a roundtrip outside the primary’s datacenter and back. With multiple semi-sync replicas it’s the time it takes for the fastest replica to respond.

When the primary goes down, we have the data durable outside its datacenter. But we can then also compare with non semi-sync replicas in the primary’s datacenter: they may yet have all the transactions, too, in which case we can promote one of them. Again, promoting within the same datacenter tends to be less disruptive to the application. Or, we can spend some time seeding them from one of our semi-sync replicas.

Or, we can choose to promote our semi-sync replicas, switching the primary data center for the replication cluster. We then must reassign semi-sync replicas, and ensure none run from within the new primary’s datacenter.

In the event of the primary’s datacenter network isolation, we promote one of our semi-sync replicas, in a different DC. We reconfigure semi-sync replicas, before making the new primary writable. This avoids a split brain scenario, assuming all of our semi-sync replicas are available to us.

What if two or more sites get network isolated at the same time?

It largely depends on your hardware distribution.

If your servers are distributed across three sites, then the majority of your sites is now down. Chances are you were only planning for single site outage at any given time, and a two-site outage is a hard downtime for you. You may or may not have the capacity to run with just one third of your hardware. You’d need to risk consistency and split brain.

Let’s assume we run out of 5 different sites (or availability zones), and 2 are down. In our “1-n” setup, split brain is still possible, even though the majority of sites are up. That’s because our semi-sync setup is about the communication between a primary and any of its replicas.

But this is also something you can attach a number to. What is the probability of two availability zones going down at the same time? For a given vendor? Cross vendors? Is there a number you’re willing to accept? What is the probability of the two zones being in outage both over X minutes? Will you be willing to wait out X minutes to maintain durability and consistency?

In our physical world, there is no hard limitation to how many sites could go down at the same time. We’re generally willing to accept that some risks are so low that we can accept them.

Quorum

Of special interest is that in our “1-n” scenario, we have a quorum of two servers out of five or more. The primary, with a single additional replica, are able to form a quorum and to accept writes. That’s how we got to have a split brain. While R2, R3, R4 form a majority of the servers, writes took place without their agreement.

People familiar with Paxos and Raft consensus protocols may find this baffling. However, reliable minority consensus is achievable, and Sugu Sougoumarane’s Consensus Algorithms series of posts continues to describe this.