惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
Consensus algorithms at scale: Part 5 - Handling races — PlanetScale
Sugu Sougoumarane · 2022-04-28 · via Blog — PlanetScale

Sugu Sougoumarane |

If you’re still catching up, you can find links to each article in the series at the bottom of this article.

In this post, we will look at how to make revocation and establishment of leadership work if there are race conditions. We will also cover forward progress requirements.

Recap of parts 1-4

  • Durability is the main reason we want to use a consensus system.
  • Since durability is use-case dependent, we made it an abstract requirement that the consensus algorithms assume nothing about the durability requirements.
  • We started off with the original properties of a consensus system as defined by Paxos and modified it to make it usable in practical scenarios. Instead of converging on a value, we changed the system to accept a series of requests.
  • We narrowed our scope down to single leader systems.
  • We came up with a new set of rules that are agnostic of durability. The essential claim is that a system that follows these rules will be able to satisfy the requirements of a consensus system. Specifically, we excluded some requirements like majority quorum that have previously been used as core building blocks in consensus algorithms.
  • We looked at a number of practical scenarios where it is difficult to make a simplistic majority quorum approach work well. A flexible consensus system would accommodate those use cases more comfortably.
  • We conceptualized the leadership change process into two distinct concerns: Revoke and Establish. This opens up more implementation options that previous traditional algorithms did not accommodate.

Two approaches to resolving race conditions

If we had only one agent that performed leadership changes, there would be no reason to worry about races. But this is not practical because a network partition could isolate that agent and prevent it from performing the necessary actions.

Introducing more than one agent to perform leadership changes requires us to handle race conditions. There are two approaches to resolving races: either the first agent wins, or the last one wins. The determination of who is first or last can vary depending on the approach used. We will drill down on this as we analyze each option.

An approach that makes the first agent win essentially prevents later agents from succeeding. This is equivalent to obtaining a lock. We will therefore call this approach lock-based. The other one will be called the lock-free approach.

We will analyze these approaches separately. As we will see below, this difference is quite fundamental, and it is surprising that it has not been called out explicitly for consensus algorithms.

The elector

Let us make a quick detour to introduce a new term.

At YouTube, each shard had fifteen eligible primaries. Having all of them scan for failures and perform active failovers was not practical. Instead, we had one agent in each region that scanned for failures and also performed leadership elections.

In other words, it is not necessary for a candidate to elect itself as leader. A separate agent can perform all the necessary steps to promote a candidate as leader. We will call this the elector. In many situations, like in the case of YouTube, this separation may be necessary. Unsurprisingly, Vitess has also inherited this separation: VTorc is a Vitess component that acts as the elector.

It is beneficial to conceptualize the role of an elector as being distinct from that of a candidate. This does not preclude a candidate from taking on such a role. This terminology will be useful for the sections below.

Lock-based approach

Obtaining a lock simplifies the problem of changing leadership. It guarantees that the elector that succeeds at getting the lock is the only one capable of making changes to the system.

However, the lock-based approach introduces a problem with forward progress. For example, an elector may successfully obtain a lock and then crash or become partitioned out of the rest of the system. This will prevent all other electors from ever being able to repair this situation. To resolve this, lock-based systems must introduce a time component: any elector that obtains a lock must complete its task within a certain period of time, after which the lock is automatically released.

A lock-based approach generally converges faster than approaches that are lock-free. This is because the first node that attempts a leadership change is likely to have made the most progress towards completing the task. Under most circumstances, giving the first elector the chance to succeed will complete the leadership change with the least disruption.

The act of obtaining a lock requires the participation of multiple nodes. This ensures that forward progress is possible if some nodes fail or if there is a network partition. Coincidentally, this problem shares some properties of distributed consensus. For this reason, the existing nodes that are participating in the quorum could be reused to obtain the lock. In fact, this is exactly what Raft does, but it is implicit.

Locking as a separate concern

One may argue that a system like Raft does not obtain a lock even though it makes the first elector win. This is because the act of obtaining a lock is shadowed by other actions it takes. If you subtract out the other actions (revoke, establish, and propagate) in the code that performs an election, it will be evident that what is left is the act of obtaining a distributed lock.

An algorithm has the option of implementing the locking function as an explicit and separate step. Choosing to do this gives us more flexibility because we can fine-tune this step without conflicting with the concerns of revocation or establishment.

How to obtain a lock

The primary requirement of using the existing nodes to obtain a lock is that the set of nodes each elector reaches has to intersect with those of the others. A majority-based system automatically ensures this intersection. This allows for algorithms like Raft to piggy-back the locking as a side-effect of performing revocation and establishment.

For systems that do not use the majority approach, the fact that revocation has to complement establishment ensures that the electors will have to reach an intersecting set of nodes. So, this property also allows us to perform locking as a side-effect of revocation and establishment.

But this is not our only option. Any mechanism that ensures that only one elector can act will work equally well. Here are some alternatives:

  • Use a simple majority for the purpose of obtaining a lock, while using a more sophisticated approach for revocation and establishment.
  • In Vitess, we use an external system like etcd to obtain such a lock. The decision to rely on another consensus system to implement our own may seem odd. But the difference is that Vitess itself can complete a massive number of requests per second. Its usage of etcd is only for changing leadership, which is in the order of once per day or week.
  • Humans could decide to manually authorize an elector to perform a leadership change, essentially giving it a “lock”.

Proposal numbers

If you are using a lock-based approach, there is no need to use proposal numbers or leadership terms for the purpose of electing a leader. But we may still need to assign such a number to facilitate the propagation of requests. We will discuss this in a subsequent blog.

Graceful leadership changes

Once a lock is obtained, the elector is guaranteed that no one else will be changing the system while they hold the lock. A big advantage of this situation is that the current leader can be discovered, which allows us to use the graceful method of changing leadership described in the previous blog.

The clock

In lock-based systems, we have to rely on accurate clocks. The system also has to make sure that sufficient tolerances are built into timeouts to account for normal clock skews, which are typically in the milliseconds. In general, it is advisable to use “many seconds” of granularity to sequence events.

Consistent reads

Relying on locks lets us exploit the time component to give leaders a lease, thereby guaranteeing no leader change during the lease period. This lets the users perform efficient, consistent reads by accessing the leader without the need for a quorum read. The existing leader could continue to renew the lease as it completes more requests, which leads to prolonged stability.

Lock-free approach

In the case of a lock-free approach, the newest elector must win over an older one. This requires the algorithm to assign a time-based order to the electors that are racing with each other. In Paxos, these are referred to as proposal numbers. In Raft, these are known as term numbers. To facilitate reasoning, we can view these numbers as timestamps.

How does this work?

The core of the lock-free approach is that the followers that accept an establishment or revocation request must remember the timestamp of the elector that issued the request and should reject requests with older timestamps.

There are two possible ways a lock-free approach would converge:

  1. The elector with the older timestamp completes its election before the one with the newer timestamp. Following this, the one with the newer timestamp will end up revoking that leadership and establishing its own.
  2. The elector with the newer timestamps completes first. Then the one with the older timestamp will fail at its attempt, and the leadership established by the newer timestamp will prevail.

To cover the above scenarios, every elector must assume that there may be another elector with an older timestamp attempting a leadership change. It must therefore attempt to revoke leadership from all potential candidates, not just the current known leader. The completion of this process ensures that all possible leaderships (present and future) with an older timestamp are invalidated. This addresses the case where an old elector is slow at performing its actions. This also adds safety against clock skews: a new leader with an incorrect older timestamp will just fail at completing a leadership change, as if it was an older leader.

Pros and Cons

The main advantage of a lock-free approach is that it naturally supports forward progress. If an existing elector fails, a different elector can initiate a new round without knowledge of the state or age of an older elector. For this reason, there is no need to depend on timeouts.

The disadvantage of a lock-free approach is that there is no certainty of a stable leader. This is because it is possible for a leadership to end between the time you discover it and give it a request.

Consistent Reads

The absence of a stable leader makes consistent reads complicated. We essentially have to resort to quorum reads.

Recommendations

The elegance of a lock-free approach may seem tempting, but the lack of a stable leader complicates everything else. Having to reach a quorum for consistent reads is a major drawback for scaling systems.

Weighing these options, a lock-based system should be preferred for large scale consensus systems. Having a stable leader simplifies many other operational parts of the system.

In Vitess, the current leader for each cluster is published through its topology, and a large number of workflows rely on this information to perform various tasks. Any operation that does not want the leader to change just has to obtain a lock before doing its work.

Read the full Consensus Algorithms series