惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
Consensus algorithms at scale: Part 3 - Use cases — PlanetScale
Sugu Sougoumarane · 2020-09-26 · via Blog — PlanetScale

Sugu Sougoumarane |

If you’re still catching up, you can find links to each article in the series at the bottom of this article.

Recap of parts 1-3

Here is a recap of what we covered in the last blog:

  • Durability is the main reason why we want to use a consensus system.
  • Since Durability is use-case dependent, we made it an abstract requirement requiring the consensus algorithms to assume nothing about the durability requirements.
  • We started off with the original properties of a consensus system as defined by Paxos and modified it to make it usable in practical scenarios: instead of converging on a value, we changed the system to accept a series of requests.
  • We narrowed our scope down to single leader systems.
  • We came up with a new set of rules that are agnostic of durability. The essential claim is that a system that follows these rules will be able to satisfy the requirements of a consensus system. Specifically, we excluded some requirements like majority quorum that have previously been used as core building blocks in consensus algorithms.

Consensus Use Cases

If there was no need to worry about a majority quorum, we would have the flexibility to deploy any number of nodes we require. We can designate any subset of those nodes to be eligible leaders, and we can make durability decisions without being influenced by the above two decisions. This is exactly what many users have done with Vitess. The following use cases are loosely derived from real production workloads:

  • We have a large number of replicas spread over many data centers. Of these, we have fifteen leader capable nodes spread over three data centers. We don’t expect two nodes to go down at the same time. Network partitions can happen, but only between two data centers; a data center will never be totally isolated. A data center can be taken down for planned maintenance.
  • We have four zones with one node in each zone. Any node can fail. A zone can go down without notice. A partition can happen between any two zones.
  • We have six nodes spread over three zones. Any node can fail. A zone can go down without notice. A partition can happen between any two zones.
  • We have two regions, each region has two zones. We don’t expect more than one zone to go down. A region can be taken down for maintenance, in which case we want to proactively transfer writes to the other region.

I have not seen anyone ask for a durability requirement of more than two nodes. But this may be due to difficulties dealing with corner cases that MySQL introduces due to its semi-sync behavior. On the other hand, these settings have served the users well so far. So, why become more conservative?

These configurations are all uncomfortable for a majority based consensus system. More importantly, these flexibilities will encourage users to experiment with even more creative combinations and allow them to achieve better trade-offs.

Reasoning about Flexible Consensus

The configurations in the previous section seem to be all over the place. How do we design a system that satisfies all of them, and how do we future-proof ourselves against newer requirements?

There is a way to reason about why this flexibility is possible. This is because the two cooperating algorithms (Request and Election) share a common view of the durability requirements, but can otherwise operate independently.

For example, let us consider the five node system. If a user does not expect more than one node to fail at any given time, then they would specify their durability requirement as two nodes.

The leader can use this constraint to make requests durable: as soon as the data has reached one other node, it has become durable. We can return success to the client.

On the election side, if there is a failure, we know that no more than one node could have failed. This means that four nodes will be reachable. At least one of those will have the data for all successful requests. This will allow the election process to propagate that data to other nodes and continue accepting new requests after a new leader is elected.

In other words, a single durability constraint dictates both sides of the behavior; if we can find a formal way to describe the requirements, then a request has to fulfil those requirements. On the other hand, an election needs to reach enough nodes to intersect with the same requirements.

For example, if durability is achieved with 2/5 nodes, then the election algorithm needs to reach 4/5 nodes to intersect with the durability criteria. In the case of a majority quorum, both of these are 3/5. But our generalization will work for any arbitrary property.

Worst Case Scenario

In the above five node case, if two nodes fail, the failure tolerance has been exceeded. We can only reach three nodes. If we don’t know about the state of the other two nodes, we will have to assume the worst case scenario that a durable request could have been accepted by the two unreachable nodes. This will cause the election process to stall.

If this were to happen, the system has to allow for a compromise: abandon the two nodes and move forward. Otherwise, the loss of availability may become more expensive than the potential loss of that data.

Practical Balance

A two-node durability does not always mean that the system will stall or lose data. A very specific sequence of failures have to happen:

  • Leader accepts a request
  • Leader attempts to send the request to multiple recipients
  • Only one recipient receives and acknowledges the request
  • Leader returns a success to the client
  • Both the leader and that recipient crash

This type of failure can happen if the leader and the recipient node are network partitioned from the rest of the cluster. We can mitigate this failure by requiring the ackers to live across network boundaries.

The likelihood of a replica node in one cell failing after an acknowledgment, and a master node failing in the other cell after returning success, is much lower. This failure mode is rare enough that many users treat this level of risk as acceptable.

Orders of Magnitude

The most common operation performed by a consensus system is the completion of requests. In contrast, a leader election generally happens in two cases: taking nodes down for maintenance, or upon failure.

Even in a dynamic cloud environment like Kubernetes, it would be surprising to see more than one election per day for a cluster, whereas such a system could be serving hundreds of requests per second. That amounts to many orders of magnitude in difference between a request being fulfilled and a leader election.

This means that we must do whatever it takes to fine tune the part that executes requests, whereas leader elections can be more elaborate and slower. This is the reason why we have a bias towards reducing the durability settings to the bare minimum. Expanding this number can adversely affect performance, especially the tail latency.

At YouTube, although the quorum size was big, a single ack from a replica was sufficient for a request to be deemed completed. On the other hand, the leader election process had to chase down all possible nodes that could have acknowledged the last transaction. We did consciously trade off on the number of ackers to avoid going on a total wild goose chase.

In the next blog post, we will take a short detour. Shlomi Noach will talk about how some of these approaches work with MySQL and semi-sync replication. Following this, we will continue pushing forward on the implementation details of these algorithms.

Read the full Consensus Algorithms series