MySQL semi-sync replication: durability consistency and split brains

Shlomi Noach | October 2, 2020

MySQL semi-sync is a plugin mechanism on top of asynchronous replication, that can offer better durability and even consistency (term defined later). It helps in high availability solutions, but can in itself reduce availability. We look at some basics and follow up to present scenarios that require higher level intervention to ensure availability and to avoid split brains from taking place.

I recommend reading this semi-sync blog post by Jean-François Gagné (aka JFG), which illustrates the internals of the semi-sync implementation, and debunks some myths about semi-sync. We will overlap a bit with another recommended post by JFG, about high availability and recovery.

Note: in this post we adopt the term “primary” over the term “master” in the context of MySQL replication. However, at this time there is no alternative to using the actual names of some configuration and status variables that use “master” terminology, and some duality is present.

Overview

As quick recap, semi-synchronous replication is a mechanism where a commit on the primary does not apply the change onto internal table data and does not respond to the user, until the changelog is guaranteed to have been persisted (though not necessarily applied) on a preconfigured number of replicas. We limit our discussion to MySQL 5.7 or equivalent.

Specifically, the primary is configured with rpl_semi_sync_master_wait_for_slave_count, which is 1 or above, if semi-sync is to be used. An INSERT transaction, for example, will not report being committed (i.e. the user will not get “OK”) until at least that number of replicas have acknowledged receipt of that transaction’s changelog (entries in the binary log). In practice, this means at least that number of replicas have written the changelog onto their relay logs where the change is now persisted. The replicas will apply the change later on, normally as soon as they can. For the scope of this post, we assume replicas are not intentionally stopped, and we can also ignore delayed replicas.

The primary waits up to rpl_semi_sync_master_timeout, after which it falls back to asynchronous replication mode, committing and responding to the user even if not all expected replicas have acknowledged receipt. In the scope of this post, we are interested in an “infinite” timeout, which we will accept to be a very large number.

The only replicas to acknowledge receipt of the changelog are replicas where semi-sync is enabled. There could be other, non semi-sync replicas in our topology. They, too, pull the changelog from the primary, but do not acknowledge receipt. As JFG points out in his post, the semi-sync replicas aren’t necessarily the most up to date. It’s possible that a non-semi-sync replica has pulled more changelog data than some or all semi-sync replicas.

Durability

The most straightforward use case for semi-sync is to guarantee durability of the data. We only approve a commit once the transaction (in the form of a changelog) is persisted elsewhere, on a different server, or on multiple servers.

As mentioned earlier, this does not mean the changelog has been applied on those servers. Replication may be lagging and the servers may be too busy to be able to apply the changelog in a timely fashion. But, at the very least, if the primary crashes, there’s “forensic evidence” in the relay logs of a semi-sync replica that can tell us what the latest changes were.

Consistency

The term “consistency” is overloaded, and especially in distributed systems. The CAP Theorem definition of consistency is, for example, very strict: once data was written on one server, any immediate follow-up read on any other server must reflect that write (or any later write).

We will suffice with eventual consistency. This in itself an overloaded term, so let’s clarify: in the case of a primary outage, we consider our system consistent if we’re able to promote a new primary within any amount of time (obviously in practice we expect the time to be short) such that it is consistent with the previous primary’s advertised state. I write “advertised” state because the end user or applications should not be surprised by any data changes when the new primary is promoted, regardless of how transactions/replication work internally.

Split brain

A split brain is a situation where there are two servers which are simultaneously taking writes, each thinking they’re the (single) primary. In a split brain scenario the data diverges between the two servers. If those servers also have replicas, then we have two divergent replication trees. Fixing and merging those divergent trees is difficult, and normally we revert the changes on one to make it look like the other (whether by backup+restore or by some rollback method).

Reiterating JFG’s observation that while engineers normally frown upon split brains, the damage of split brain may well be within a product’s allowance.

Topologies and scenarios

As always, getting best durability as possible, best consistency as possible, best response times as possible, best split brain mitigation as possible, is costly. Product owners and engineers make reasonable tradeoffs.

Let’s consider a few configured topologies, and see what scenarios we can expect upon failure. In particular, how durable our data is, can we expect it to be consistent, and can we expect split brains.

General scenario: wait for single replica, multiple semi-sync replicas

In this common scenario our primary is configured with rpl_semi_sync_master_wait_for_slave_count=1, and there are multiple (more than one) replicas configured with semi-sync. Let’s say there are 4 semi-sync replicas, named R1, R2, R3, and R4. We nickname this setup “1-n”

In this setup, a write takes place on the primary, and before approving the write to the user/app, the primary waits for an acknowledgement from exactly one replica. It doesn’t matter which. For some writes, it will be replica R1; at some other time, R4 is the first to acknowledge.

It’s important to note that the changelog, the binary log, is sequential. A replica that acknowledges some changelog event, has necessarily received all of its prior events. This is why the primary doesn’t need to care that different acknowledgements come from different replicas. An acknowledgement means there’s a replica with full durability of committed transactions up to and including the one in question.

Should the primary fail, we have at least one replica guaranteed to have all approved commits in its relay logs. These are not necessarily applied, but if our replication cluster is normally low-lagged, we can expect those commits to be few, and we can expect the replica to catch up (apply those commits) within a few seconds or less.

A possible situation is that there may be non-semi-sync replicas even more up to date than our semi-sync replica. We can choose to use them. The fact they have more transactions than the most up-to-date semi sync replica means they have received binary log events not acknowledged by a semi-sync replica. While this may seem confusing at first, these are fine to accept, and do not contradict our promise to our users. We promised that “if the primary tells you a commit is successful, then the data is durable elsewhere”. We never said anything about the state of the data before telling the user the commit is successful. In fact, this scenario is no different than the primary actually getting a commit acknowledged by the replica, begins to communicate that to the user, then crashing halfway. The user cannot distinguish between the two cases at the time of failure (though post crash analysis may indicate which of the two was true).

And so we can choose to use the data from that even-more-up-to-date non-semi-sync replica, seed the rest of the replicas with that data, or, we can choose to throw away any server that is more-up-to-date than our most-up-to-date semi-sync replica and recycle them. It’s our choice and it’s down to operational considerations (time, capacity, load, ...).

If data is durable, does that mean it’s available?

Not necessarily. Consider this possible, even likely scenario: the primary and R1 are both in the same data center. R2, R3, R4 are in different availability zones, possibly remote, and so have higher latency than R1 communicating to the primary.

The app issues an UPDATE on the primary. R1 acknowledges the write. R2, R3, R4 do not, as yet. But the primary does not care, it commits and reports success to the app. Then the primary and R1 both get network isolated together because the DC they’re in has lost network. The loss of network cut short the delivery of the change log to R2, R3, R4. They never got it.

Should we promote either R2, R3, or R4, we lose consistency. The app expects the result of that UPDATE to be there, but neither of these servers have the data. Moreover, even human intervention is limited. Since the data is only found on primary and R1, we cannot get hold of the data. A human can always transport to the physical server to extract the data and send it over via cellular network, or even by car. It’s likely that this will take substantial time.

A predefined failover plan may choose to promote, say, R2, after all, losing some data, but cutting short on outage time. Reconciliation will have to take place afterwards.

Do we even know the status?

If the primary and one of the semi-sync replicas go offline, we have an additional problem: we can’t say for sure whether any of the remaining semi-sync replicas got the latest updates from the primary. It’s possible that they did. But it’s also possible that the single semi-sync replica that went offline, was the single one to get and acknowledge some last writes taking place on the primary.

To generalize, if the number of lost semi-sync replicas is equal to, or greater than rpl_semi_sync_master_wait_for_slave_count, then we do not know whether the remaining replicas are consistent.

Split Brain

But, arguably worse, is that we can now be in a split brain situation. Apps in the primary datacenter were still writing to the old primary when the network went down. Those writes continued to run. The old primary now receives writes never seen by R2 (the newly promoted primary). Once we do wish to reconcile, we are faced with contradicting information.

Physical locations

The geo-distribution of our servers plays a key part in how tolerant our system is for failure and what outcomes we can expect.

Colocated primary and semi-sync replicas

On a normal day, writes to the primary have low latency: acknowledgements from semi-sync replicas are quick to arrive thanks to the fast network within a datacenter.

When the primary fails, we have at least one good semi-sync replica to promote (or we can use it to seed a different server to promote). Because we promote within the same datacenter, app behavior is unlikely to be affected, other than the obvious disruption during the failover time.

If the datacenter is network-isolated, we need to make a decision: wait out the outage, incurring downtime to our services, or promote a new primary in another datacenter. We risk losing transactions (we cannot know whether we lose transactions until we are again able to connect to the original primary). We also risk a split brain scenario.

Cross-site semi-sync replication

If we only run semi-sync replicas in a different datacenter than the primary’s, we first pay with increased write latency, as each commit runs a roundtrip outside the primary’s datacenter and back. With multiple semi-sync replicas it’s the time it takes for the fastest replica to respond.

When the primary goes down, we have the data durable outside its datacenter. But we can then also compare with non semi-sync replicas in the primary’s datacenter: they may yet have all the transactions, too, in which case we can promote one of them. Again, promoting within the same datacenter tends to be less disruptive to the application. Or, we can spend some time seeding them from one of our semi-sync replicas.

Or, we can choose to promote our semi-sync replicas, switching the primary data center for the replication cluster. We then must reassign semi-sync replicas, and ensure none run from within the new primary’s datacenter.

In the event of the primary’s datacenter network isolation, we promote one of our semi-sync replicas, in a different DC. We reconfigure semi-sync replicas, before making the new primary writable. This avoids a split brain scenario, assuming all of our semi-sync replicas are available to us.

What if two or more sites get network isolated at the same time?

It largely depends on your hardware distribution.

If your servers are distributed across three sites, then the majority of your sites is now down. Chances are you were only planning for single site outage at any given time, and a two-site outage is a hard downtime for you. You may or may not have the capacity to run with just one third of your hardware. You’d need to risk consistency and split brain.

Let’s assume we run out of 5 different sites (or availability zones), and 2 are down. In our “1-n” setup, split brain is still possible, even though the majority of sites are up. That’s because our semi-sync setup is about the communication between a primary and any of its replicas.

But this is also something you can attach a number to. What is the probability of two availability zones going down at the same time? For a given vendor? Cross vendors? Is there a number you’re willing to accept? What is the probability of the two zones being in outage both over X minutes? Will you be willing to wait out X minutes to maintain durability and consistency?

In our physical world, there is no hard limitation to how many sites could go down at the same time. We’re generally willing to accept that some risks are so low that we can accept them.

Quorum

Of special interest is that in our “1-n” scenario, we have a quorum of two servers out of five or more. The primary, with a single additional replica, are able to form a quorum and to accept writes. That’s how we got to have a split brain. While R2, R3, R4 form a majority of the servers, writes took place without their agreement.

People familiar with Paxos and Raft consensus protocols may find this baffling. However, reliable minority consensus is achievable, and Sugu Sougoumarane’s Consensus Algorithms series of posts continues to describe this.

推荐订阅源

Blog — PlanetScale