惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
Consensus algorithms at scale: Part 4 - Establishment and revocation — PlanetScale
Sugu Sougoumarane · 2022-04-06 · via Blog — PlanetScale

Sugu Sougoumarane |

If you’re still catching up, you can find links to each article in the series at the bottom of this article.

The Leader election process is the less frequently used part of the consensus process. However, it is the more complex one. We will therefore drill into this part first.

Recap of parts 1-3

  • Durability is the main reason we want to use a consensus system.
  • Since Durability is use-case dependent, we made it an abstract requirement that the consensus algorithms assume nothing about the durability requirements.
  • We started off with the original properties of a consensus system as defined by Paxos and modified it to make it usable in practical scenarios. Instead of converging on a value, we changed the system to accept a series of requests.
  • We narrowed our scope down to single leader systems.
  • We came up with a new set of rules that are agnostic of durability. The essential claim is that a system that follows these rules will be able to satisfy the requirements of a consensus system. Specifically, we excluded some requirements like majority quorum that have previously been used as core building blocks in consensus algorithms.
  • We looked at a number of practical scenarios where it is difficult to make a simplistic majority quorum approach work well. A flexible consensus system would accommodate those use cases more comfortably.

Consensus Algorithms at Scale - Part 1

Consensus Algorithms at Scale - Part 2

Consensus Algorithms at Scale - Part 3

Conflating too many concerns

Traditional algorithms like Paxos and Raft try to do too many things at once. The cleverness of those approaches is commendable. However, such implementations are too rigid, and you cannot make modifications to specific parts of the algorithm without breaking something else.

What we are going to do now is separate those concerns, and talk about how to address them individually. We can still choose to conflate them, but it should be a conscious decision.

An important revelation is that all leader-based consensus algorithms perform the following actions when electing a new leader:

  • Revoke a previously existing leadership
  • Establish a new leader

An additional constraint is that a revoke must precede the establishment step. Otherwise, we will end up with more than one leader.

Majority-based consensus algorithms satisfy this constraint atomically: When a leader successfully recruits all the necessary followers, it automatically achieves the goal of revoking the previous leadership.

Because the revoke was implicitly achieved, it was never called out as a separate concern. More importantly, it was never called out as a concern that could be separated.

In other words, it is not necessary to perform the two operations as part of the same action. This separation becomes more important for consensus systems that are not majority-based.

To limit complexity, this section will start by focusing on establishment and revocation of leadership. Once we have analyzed these two actions, we will layer in the rest of the concerns, which are forward progress, race handling, and propagation of requests.

Even though the two actions can be performed separately, there exists a strong relationship between them: Leadership is established when all the parameters are in place for a leader to successfully complete requests. Any change that invalidates this condition is a revocation.

Proposal Numbers

In traditional consensus algorithms, the establishment of leadership is achieved by requesting followers to accept a specific proposal number. If a candidate manages to perform this action on the majority of the nodes, then the leadership is deemed as established.

To revoke such a leadership, the new candidate pushes a different proposal number to those followers, which implicitly revokes the previous leader’s ability to propagate requests to those nodes. When the majority of the followers are reached, the revocation of the previous leadership and the establishment of the new one are simultaneously achieved.

Without Proposal Numbers

The usage of proposal numbers is only one of many methods of establishing and revoking leadership. For example, in MySQL, the replication mechanism could also be used to achieve the same objectives: Pointing a semi-sync replica at a primary is an act of leadership establishment. Requesting such a replica to stop replicating or to replicate from a different source would achieve the objective of revocation.

Knowing the current leader

Depending on how we handle races, the current leader may not be known. If so, the revocation must be performed against all potential leaders. In other words, the election process must reach enough nodes to be sure that no existing leader can complete their requests. This will become more clear in the next blog where we will cover race conditions

Direct Leader Demotion

Now that we have identified revocation as a possible separate action, we can look at more than one way to revoke an existing leadership.

If the current leader is known, requesting that leader to step down also results in a valid revocation. This method is generally more graceful because the leader has the opportunity to complete in-flight requests and also inform clients of an imminent change in leadership.

Demoting the existing leader is meaningful only for planned changes, like a software rollout. If a leader becomes unreachable due to a crash or a network partition, we have to fall back to requesting the followers to stop accepting requests from the current leader to achieve revocation.

In Vitess, we have two operations that can perform a leadership change: PlannedReparentShard (PRS) and EmergencyReparentShard (ERS). For software rollouts, we use PRS to demote the current primary to a replica before performing the update. But we use ERS if we detect that the primary database is down or not reachable.

If a PRS is issued, the low level vttablet component of vitess goes into a lameduck mode where it allows in-flight transactions to complete, but rejects any new ones. At the same time, the front-end proxies (vtgate) begin to buffer such new transactions. Once PRS completes, all buffered transactions are sent to the new primary, and the system resumes without serving any errors to the application.

Why use two approaches?

A typical cluster could be completing thousands of requests per second. In contrast, a software rollout is likely a daily event. In further contrast, a node failure may happen once a month or even less frequently.

It is important that we optimize for the common case. This means that we want leadership changes to be graceful during software rollout. Ideally, the application should see no errors during this time. The approach of demoting the current leader gives us this opportunity.

Interchangeability

Can we assume that two different algorithms are interchangeable? The answer is yes. Let us assume that a leadership is established by satisfying conditions A and B. One algorithm achieves revocation by making condition A false, and the other by making condition B false. In both cases, it is a successful revocation.

Once revocation is complete, both algorithms have to make conditions A and B true for the new leader, which will allow for subsequent rounds to use any method of revocation.

Other approaches

We can think of innumerable other ways to establish and revoke leadership, and they would all be valid as long as the revocation and establishment conditions are accurately satisfied. As an extreme example, cutting the network cable that connects a leader to its followers is also a valid way to revoke an existing leadership.

I know of one incident at Google where we had to dispatch a human to physically shut down a machine where a leader had gone rogue.

In the next blog post, we will discuss possible options for handling races and ensuring forward progress. At that time, we will re-evaluate these approaches.

Read the full Consensus Algorithms series