惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

文章列表

Compulsive curiosity, or, how I built an infinite idea machine Gift details on the subscriber portal Portal link in the archive nav The physicists who convinced Fermilab to send Brazil's emails First, add no friction: How micropayments lost and subscriptions won Filter subscribers and automations by source Automations, rebuilt What email will look like in the future Filter subscribers by bounce date and reason Email could have been X.400 times better Three features are moving behind the paywall Firewall changes and improvements Put your name and voice into your company newsletter Simplified email address settings Subscription wall Inboxes were overwhelming before we'd even named them The US government tried really hard to screw up email Public postmortem: database connection exhaustion Ask a nerd: what is the best way to unsubscribe from newsletters? Bookshop.org embeds Email was into agents before they were cool Passwordless login Rename metadata keys in bulk A spring cleaning for our legal docs Ask a nerd: what happens when you click the spam button? Passkey support for two-factor authentication How Buttondown's API versioning works Safer defaults for the email creation API How to send email to space How we enabled Content Security Policy for everyone Recovery codes for two-factor authentication Filter sent emails by engagement rate How we migrated to TypeIDs without breaking clients How we check every link in your email Use newsletter metadata in your emails Should we bring back email exploders? Sort and filter by open and click rates Custom click tracking domains More newsletter settings in the API Revamped replies Custom email templates for everyone Simplified cancellation Ask a Nerd: Does email length affect deliverability? The changelog, reborn Swedish localization Forwarding an email is not always straightforward Public descriptions for tags OpenAPI spec for archives How Rodrigo brings a humanistic view to consumer technology Survey responses on the web How Brandon Lucas Green shares his music and supports artists Subscribers can come from anywhere. Even another newsletter platform's form. Your newsletter's archives are more valuable than your list Better tag self-management Smarter automation filters Granular API keys New design settings pages Snippets Ask A Nerd: How does newsletter cadence affect deliverability? Starred views More ways to customize your archives Inbox filtering Mastodon follower analytics Ask a Nerd: What are good open, click, and response rates for an email newsletter? How we migrated our database to PlanetScale Two new archive themes Custom buttons now work in Markdown mode Ask a Nerd: Does attaching files to your newsletter hurt deliverability? Seline and Tinylytics support Unban subscribers Announcement bars for your archives Bang paths, source routing, and how email trips were planned Public postmortem: archive downtime 2025 disposables.app Russian localization Ask a Nerd: Can you improve email deliverability with a personal domain? More locale options How we interview customers at Buttondown Bluesky analytics Reply to conversations Minimum viable complexity How Jeffery Hicks goes behind-the-scenes in his newsletter Changes to our stack in 2025 2026: Emails Randomize survey answer order TK reminders in the editor What the hell is a UTM? Why we insourced analytics Scroll sync in the editor 2026: Archives How Jamie Thingelstad uses Buttondown to explore tech topics How Kelly Jensen uses Buttondown to discuss key library issues Keeping feature creep at bay Improved filters Content Security Policy in archives Open source Sniperl.ink Auto-activating RSS reader subscriptions What the hell is ActivityPub? How Igor Ranc built Berlin's largest expat tech newsletter
Public postmortem: email delays
Justin Duke · 2024-08-16 · via

What happened?

Our asynchronous job processor, workerscheduler, which is responsible for running scheduled asynchronous jobs, was completely down for about six hours. The most significant impacts were:

  • Outbound emails couldn’t be sent
  • Scheduled (cron) jobs weren’t running

While other functionality was affected, these were by far the most critical.


Why did it happen?

At a high level, our async job scheduling works like this (using standard RQ):

  1. Job enqueueing: We serialize the method name, the arguments to pass to that method, and a timestamp, then store all of that in Redis.
  2. Job execution: To find a job to execute, we pull all potential jobs, sort them by timestamp, and start running whichever is ready.

The trouble arises with the arguments we pass in—especially when they're large objects.

Consider this simplified example job:

class Email:
    id: str
    subject: str
    body: str

@job('five_minutes')
def send_email(email: Email, recipients: list[str]):
    for recipient in recipients:
        send_email_to_recipient(email, recipient)

When enqueuing, it gets serialized like:

{
  "method_name": "path.to.module.send_email",
  "arguments": [
    {
      "class": "path.to.module.email",
      "id": "1",
      "subject": "Hi there!",
      "body": "How are you doing?"
    },
    [
      "penelope@buttondown.com",
      "telemachus@buttondown.com"
    ]
  ]
}

While this example is simplified, real email jobs are much heavier. Our Email object has over sixty fields—including multiple versions of the email body—and some emails are massive (over 5MB in memory).

In practice, when a newsletter goes out to, say, 30,000 subscribers, recipients get batched. For example: 100 recipients per batch × 300 batches = 300 jobs. If each serialized job embeds a 5MB email object, that's 1.5GB of data in Redis, and workerscheduler attempts to load and deserialize all that just to check which job is next to run.

This is exactly what happened: our Heroku dyno running workerscheduler has a 512MB memory cap. Trying to pull the job queue caused it to OOM and crash, then restart, and repeat indefinitely.


What’s the right approach?

Don’t serialize large objects! Instead, serialize only what’s needed, like the object’s ID, and rehydrate from the database inside the job:

class Email:
    id: str
    subject: str
    body: str

@job('five_minutes')
def send_email(email_id: str, recipients: list[str]):
    email = fetch_email_from_db(email_id)
    for recipient in recipients:
        send_email_to_recipient(email, recipient)

Now, the serialization looks like:

{
  "method_name": "path.to.module.send_email",
  "arguments": [
    "1",
    [
      "penelope@buttondown.com",
      "telemachus@buttondown.com"
    ]
  ]
}

This is our pattern almost everywhere—except in one place, the per-domain rate limiting logic, which unfortunately is exactly what triggered the meltdown.


How did we fix it?

The short-term solution was to clear out the excessively large jobs—using the extra memory available on my laptop:

import django_rq
from rq.job import Job

STRING_OF_JOB_TO_REMOVE = "send_email_to"
QUEUE_NAME = "five_minutes"
queue = django_rq.get_queue(QUEUE_NAME)
jids = queue.scheduled_job_registry.get_job_ids()
jobs = Job.fetch_many(jids, connection=django_rq.get_connection(QUEUE_NAME))

for job in jobs:
    print(job.description)
    if STRING_OF_JOB_TO_REMOVE in job.description:
        queue.scheduled_job_registry.remove(job.id)  # Corrected: should remove by job.id
        print("Removing!")

After removing the problematic jobs and re-running some processes to resume normal flow, everything was back up (with a backlog to process, but operational).


Why didn’t we catch it sooner?

Nearly all of our observability depends on cron-like scheduled jobs. When crons are down, so are our monitoring tools, leaving us blind. That’s a lesson learned.


How are we preventing a recurrence?

  • Code Fix: The problematic rate limiting path now loads emails by ID instead of serializing the entire object.
  • Observability: We now rely on Better Stack for monitoring, so our visibility doesn’t depend exclusively on our own infra. Notably, we get paged if no cron jobs run for five minutes—this would have immediately surfaced the problem.
  • Tooling: We’ve built internal tools to analyze the backlog and improve diagnostics.

While it only took about 30 minutes to locate the issue, we’re working to improve so detection (and resolution) will be even faster next time.