惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
How we made PlanetScale’s background jobs self-healing — PlanetScale
Mike Coutermarsh · 2022-02-17 · via Blog — PlanetScale

Mike Coutermarsh |

When building PlanetScale, we knew that we would be using background jobs extensively for tasks like creating databases, branching, and deploying schema changes.

Early on, we decided we had two hard requirements for this system.

  1. If we lose all data in the queues at any time, we can recover without any loss in functionality.
  2. If a single job fails, it will be automatically re-run.

These requirements came about from our past experience with background job systems. We knew that it was "not if, but when" that we’d hit failure scenarios and need to recover from them.

Our background jobs perform critical actions for our users, such as rolling out schema changes to their production databases. This is an action that cannot fail, and if it does, we need to recover quickly.

Our stack

The PlanetScale UI (Next.js app) is backed by a Ruby on Rails API. All of our Rails background jobs are run on Sidekiq.

Much of what you’ll read here is Sidekiq specific (with code samples) but can be applied to most any background queueing system.

Scheduler jobs running on a cron

For each job we have, we set up another job whose responsibility is to schedule it to run.

This is the core design decision that allows us to be self-healing. If for any reason we lost a job, or even the entire queue, the scheduling jobs will re-queue anything that was missed.

Diagram showing triggering several BackupJobs

We’ve put this decision to the test a couple of times already this year. We were able to dump our queues entirely without impacting our user experience.

Storing state in the database

When using a background job system, a common pattern is to let jobs only be queued by a user action.

For example: In our application, a user can create a new database. This action enqueues a background job to do all the setup.

What if something goes wrong with that job? What if Redis failed and we lost everything in our queues? These were scenarios that worried us.

The solution is storing the state in our PlanetScale database. When creating a database for a user, we also create a record in our databases table immediately. This record starts with a state set to pending.

This allows us to have a scheduled job that runs once a minute and checks if any databases are in a pending state. If they are, that triggers the creation job to get enqueued again:

class ScheduleDatabaseJobs < BaseJob
  sidekiq_options queue: "background"

  def perform
    Database.pending.find_each do |database|
      DatabaseCreationJob.perform_async(database.id)
    end
  end
end

As long as the scheduler job is running, we could dump our entire queue and still recover.

Disabling schedules with feature flags

We added the ability to stop our scheduled jobs from running at any time. This has come in useful during an incident where we’ve wanted control over a specific job type.

To do this, we added middleware that checks a feature flag for each job type. If the flag is enabled, we skip running the job:

# lib/sidekiq_middleware/scheduled_jobs_flipper.rb
module SidekiqMiddleware
  class ScheduledJobsFlipper
    def call(worker_class, job, queue, redis_pool)
      # return false/nil to stop the job from going to redis
      klass = worker_class.to_s
      if BaseJob::SCHEDULED_JOBS.key?(klass) && Flipper.enabled?("disable_#{klass.underscore.to_sym}")
        return false
      end

      yield
    end
  end
end
# initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.client_middleware do |chain|
    chain.add(SidekiqMiddleware::ScheduledJobsFlipper)
  end
end

Bulk scheduling jobs

Scheduling jobs one-by-one works well when you only have a few thousand. Once our app grew, we noticed we were spending a lot of time sending individual Redis requests. Each request is very fast, but requests add up when run thousands of times.

To improve this, we started bulk scheduling jobs:

# Job is set to run every 5 minutes
class ScheduleBackupJobs < BaseJob
  sidekiq_options queue: "background"

  def perform
    # Schedule backup jobs in batches of 1,000
    BackupPolicy.needs_to_run.in_batches do |backup_policies|
      BackupJob.perform_bulk(backup_policies.pluck(:id))
    end
  end
end

Adding jitter to job scheduling

We don’t want certain types of jobs all running at once because they may overwhelm an external API. In those cases, we spread them out over a time period using perform_with_jitter. Below is a custom method we added to our job classes to handle this:

# This will run sometime in the next 30 minutes
CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes)
# app/jobs/application_job.rb
MAX_WAIT = 1.minute
def self.perform_with_jitter(*args, **options)
  max_wait = options[:max_wait] || MAX_WAIT
  min_wait = options[:min_wait] || 0.seconds
  random_wait = rand(min_wait...max_wait)
  set(wait: random_wait).perform_later(*args)
end

Handling uniqueness

With the way we schedule jobs, it’s possible for us to get multiple of the same job in our queues. We handle this in a few ways.

1. Exit quickly

We store state in our database and quickly exit a job if it no longer needs to be run:

def perform(id)
  user = user.find(id)
  return unless user.pending?
  # ...
end

2. Use database locks

We avoid race conditions, such as when multiple jobs are updating the same data at once:

backup.with_lock do
  backup.restore_from_backup!
end

3. Use sidekiq unique jobs

Sidekiq Enterprise includes the ability to have unique jobs. This will stop a duplicate job from ever being enqueued:

class CheckDeploymentStatusJob < BaseJob
  sidekiq_options queue: "urgent", retry: 5, unique_for: 1.minute, unique_until: :start
  #...
end

Learn more