惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
Orchestrator failure detection and recovery: New Beginnings — PlanetScale
Shlomi Noach · 2020-09-19 · via Blog — PlanetScale

Shlomi Noach |

Orchestrator is an open source MySQL replication topology management and high availability solution. Vitess has recently integrated orchestrator as a native component of its infrastructure to achieve reliable failover, availability, and topology resolution of its clusters. This post first illustrates the core logic of orchestrator’s failure detection, and proceeds to share how the new integration adds new failure detection and recovery scenarios, making orchestrator’s operation goal-oriented.__

Note: in this post we adopt the term “primary” over the term “master” in the context of MySQL replication.

Orchestrator’s holistic failure detection

Vitess and orchestrator both use MySQL’s asynchronous (async) or semi-synchronous replication. For the purposes of this post, the discussion is limited to async replication. In an async setup, we have one primary server and multiple replicas. The primary is the single writable server and the replicas are all read-only, mainly being used for read scale-out, backups, etc. While MySQL offers a multi-writable primaries setup, it is commonly discouraged, and Vitess does not support it (in fact, a multi-writer setup is considered a failure scenario as described later on).

The most critical and important failure scenario in an async topology is a primary’s outage. Either the primary server has crashed, or is network isolated: the result is that there are no writes on the cluster, and the replicas are left hanging with no server to replicate from.

Common failure detection practices

How does one diagnose that the primary server is healthy? A common practice is to see that port :3306 is open. More reliably, we can send a trivial query, such as SELECT 1 FROM DUAL. Or even more reliable is to query for actual information: a status variable, or actual data. All these techniques share a similar problem. What if the primary server doesn’t respond?

A naive conclusion is that the primary is down, kicking off a failover sequence. However, this may well be a false positive since there could be a network glitch. It is not uncommon to miss a communication packet once in a while, so database clients are commonly configured to retry a couple times upon error. The common way to reduce such false positives is to run multiple checks, successively: if the primary fails a health check, try again in, say, 5 seconds, and again, and again, up to n times. If the nth test still fails, we determine the server is indeed down.

This approach yet introduces a few problems:

  • Exactly when is enough tests?
  • Exactly what is a reasonable check interval?
  • What if the primary is really down? We have wasted **n***interval seconds to double check, triple check, etc., when we could have failed over sooner.
  • What if the primary is really _up, and the problem is with the network between the primary and our testing endpoint? That’s a false negative and we failed over for nothing.

Consider the last bullet point. Some monitoring solutions run health checks from multiple endpoints, and require a quorum, an agreement of the majority of check endpoints that there is indeed a problem. This kind of setup must be used with care; the placement of the endpoints in different availability zones is critical to achieve sensible quorum results. Once that’s done, though, the triangulation is powerful and useful.

Orchestrator’s approach

Orchestrator uses a different take on triangulation. It recognizes that there are more players in the field: the replicas. The replicas connect to the primary over MySQL protocol, and request the changelog so as to follow up on the primary’s footsteps. To evaluate a primary failure, orchestrator asks:

  • Am I failing to communicate with the primary? And,
  • Are all replicas failing to communicate with the primary?

If, for example, orchestrator is unable to reach the primary, but can reach the replicas, and the replicas are all happy and confident that they can read from the primary, then orchestrator concludes there’s no failure scenario. Possibly some of the replicas themselves are unreachable: maybe a network partitioning or some power failure took both primary and a few of the replicas. orchestrator can still reach a conclusion by the state of all available replicas. It’s noteworthy that orchestrator itself runs in a highly available setup, cross availability zones, where orchestrator requires quorum leadership so as to be able to run failovers in the first place, mitigating network isolation incidents. But this discussion is outside the scope of this post. Orchestrator doesn’t do check intervals and a number of tests. It needs a single observation to act. Behind the scenes, orchestrator relies on the replicas themselves to run retries in intervals; that’s how MySQL replication works anyhow, and orchestrator utilizes that.

This holistic approach, where orchestrator triangulates its own checks with the servers’ checks, results in a highly reliable detection method. Iterating our example, if orchestrator thinks the primary is down, and all the replicas say the primary is down, then a failover is justified: the replication cluster is effectively not receiving any writes, the data becomes stale, and that much is observable all the way to the users and client apps. The holistic approach further allows orchestrator to treat other scenarios: an intermediate replica (e.g. 2nd level replica in a chained replication tree) failure is detected in exactly the same way. It further offers granularity into the failure severity. orchestrator is able to tell that the primary is seen down, while replicas still disagree. Or that replicas think the primary is down while orchestrator can still see it.

Emergency detection operations

If orchestrator can’t see the primary, but can see the replicas, and they still think the primary is up, should this be the end of the story?

Not quite. We may well have an actual primary outage, it’s just that the replicas haven’t realized it yet. If we wait long enough, they will eventually report the failure; but orchestrator wishes to reduce total outage time by resolving the situation as early as possible.

Orchestrator offers a few emergency detection operations, which are meant to speed up failure detection. Examples:

  • As in the above, orchestrator can’t see the primary. Emergently probe the replicas to check what they think. Normally each server is probed once in a few seconds, but orchestrator now chooses to probe sooner.
  • A first tier replica reports it can’t see the primary. The rest of the replicas are fine, and orchestrator can see the primary. This is still very suspicious, so orchestrator runs an emergency probe on the primary. If that fails, then we’re on to something, falling back to the first bullet.
  • orchestrator cannot reach the primary, replicas can all reach the primary, but lag on replicas is ever increasing. This may be a limbo scenario caused by either a locked primary, or a “too many connections” situation. The replicas are likely to be some of the oldest connections to the primary. New connections cannot reach the primary and to the app it seems down, but replicas are still connected. orchestrator can analyze that and emergently kick a replication restart on all replicas. This closes and reopens the TCP connections between replicas and primary. On locked primary or on “too many connections” scenarios, replicas are expected to fail reconnecting, leading to a normal detection of a primary outage.

Orchestrator and your replication clusters

An important observation is that orchestrator knows what your replication clusters actually look like, but doesn’t have the meta information about how they should look like. It doesn’t know if some standalone server should belong to this or that cluster; if the current primary server is indeed what’s advertised to your application; if you really intended to set up a multi-primary cluster. It is generic in that it allows a variety of topology layouts, as requested and used by the greater community.

Old Vitess-orchestrator integration

For the past few years, orchestrator was an external entity to Vitess. The two would collaborate over a few API calls. orchestrator did not have any Vitess awareness, and much of the integration was done through pre- and post- recovery hooks, shell scripts and API calls. This led to known situations where Vitess and orchestrator would compete over a failover, or make some operations unknown to each other, causing confusion. Clusters would end up in split state, or in co-primary state. The loss of a single event could cause cluster corruption.

Orchestrator as first class citizen in Vitess

We have recently integrated orchestrator into Vitess as an integral part of the vitess infrastructure. This is a specialized fork of orchestrator, that is Vitess-aware. In fact, the integrated orchestrator is able to run Vitess native functions, such as locking shards or fetching tablet information.

The integration makes orchestrator both cluster aware and goal driven.

Cluster-awareness

MySQL itself has no concept of a replication cluster (not to be confused with InnoDB cluster or MySQL Cluster): servers just happen to replicate from each other, and MySQL has no opinion on whether they should replicas from each other, or what’s the overall health and status of the replication tree. orchestrator can share observations and opinions on the replication tree, based on what it can see. Vitess, however, has a firm opinion on what it expects. In Vitess, each MySQL server has its own vttablet, an agent of sorts. The tablet knows the identity of the MySQL server: which schema it contains; part of what shard it is; what role it assumes (primary, replicas, OLAP, ...) etc. The integrated orchestrator now gets all of the MySQL metadata directly from the Vitess topology server. It knows beyond doubt that two servers belong to the same cluster, not because they happen to be connected in a replication chain, but because the metadata provided by Vitess says so. orchestrator can now look at a standalone, detached server, and tell that it is, in fact, supposed to be part of some cluster.

Goal driven

This cluster awareness is a fundamental change in orchestrator’s approach, and allows us to make orchestrator goal-driven. orchestrator’s goal is to ensure a cluster is always in a state compatible with what Vitess expects it to be. This is accomplished by introducing new failure detection modes not possible before, and new recovery methods too opinionated otherwise. Examples:

  • orchestrator observes a standalone server. According to Vitess’ topology server, that server is a REPLICA. orchestrator diagnoses this as a “replica without a primary” and proceeds to connect it with the proper replication cluster, after validating that GTID-wise the operation is supported.
  • orchestrator observes a REPLICA that is writable. Vitess does not support that setup. orchestrator turns the replica to be read-only.
  • Likewise, orchestrator sees that the primary is read-only. It switches it to be writable.
  • orchestrator detects a multi-primary setup (circular replication). Vitess strictly forbids this setup. orchestrator checks with the topology service which of the two is marked as the true primary, then makes the other(s) standard replicas. To emphasize the point, a multi-primary setup is considered to be a failure scenario.
  • Possibly the most intriguing scenario is where orchestrator sees a fully functional replication tree, with writable primary and read-only replicas, but notices that Vitess thinks the primary should be one of the replicas, and that the server that acts as the cluster’s primary should be a replica. This situation can result from a previously, prematurely terminated failover process. In this situation, orchestrator runs a graceful-takeover (or a planned-reparent, in Vitess jargon) to actually promote the correct server as the new primary, and to demote the “impersonator” primary.

Thus, Vitess has an opinion of what the cluster should look like, and orchestrator is the operator that makes it so. It is furthermore interesting to note that orchestrator’s operations will either fail or converge to the desired state.

But, what if a primary unexpectedly fails? What server should be promoted?

On an unexpected failure, it is orchestrator’s job to pick and promote the most suitable server, and to advertise its identity to Vitess. The new interaction ensures this is a converging process and that orchestrator and vitess do not conflict with each other over who should be the primary. Orchestrator promotes a server based on multiple limiting factors: is the server configured such that it can be a primary, e.g. has binary logs enabled? Does its version match the other replicas? What are the general recommendation for the specific host (metadata acquired from Vitess). But there are also general, non server-specific rules, that dictates what promotions are possible. Do we strictly have to only failover within the same data center? The same region/availability zone? Or, do we strictly have to only failover outside the data center? Do we only ever failover onto a server configured as semi-sync replica? And how do we reconfigure the cluster after promotion?

Previously, some of these questions were answered by configuration variables, and some by the user’s infrastructure. However, the new integration allows the user to choose a failover and recovery policy, that is described in code. Orchestrator and Vitess already support three pre-configured modes, but will also allow the user to define any arbitrary (within a set of rules) policy they may choose.

More on that in a future post.