Public postmortem: email delays

What happened?

TL;DR

Our asynchronous job processor, workerscheduler, which is responsible for running scheduled asynchronous jobs, was completely down for about six hours. The most significant impacts were:

Outbound emails couldn’t be sent
Scheduled (cron) jobs weren’t running

While other functionality was affected, these were by far the most critical.

Why did it happen?

At a high level, our async job scheduling works like this (using standard RQ):

Job enqueueing: We serialize the method name, the arguments to pass to that method, and a timestamp, then store all of that in Redis.
Job execution: To find a job to execute, we pull all potential jobs, sort them by timestamp, and start running whichever is ready.

The trouble arises with the arguments we pass in—especially when they're large objects.

Consider this simplified example job:

class Email:
    id: str
    subject: str
    body: str

@job('five_minutes')
def send_email(email: Email, recipients: list[str]):
    for recipient in recipients:
        send_email_to_recipient(email, recipient)

When enqueuing, it gets serialized like:

{
  "method_name": "path.to.module.send_email",
  "arguments": [
    {
      "class": "path.to.module.email",
      "id": "1",
      "subject": "Hi there!",
      "body": "How are you doing?"
    },
    [
      "penelope@buttondown.com",
      "telemachus@buttondown.com"
    ]
  ]
}

While this example is simplified, real email jobs are much heavier. Our Email object has over sixty fields—including multiple versions of the email body—and some emails are massive (over 5MB in memory).

In practice, when a newsletter goes out to, say, 30,000 subscribers, recipients get batched. For example: 100 recipients per batch × 300 batches = 300 jobs. If each serialized job embeds a 5MB email object, that's 1.5GB of data in Redis, and workerscheduler attempts to load and deserialize all that just to check which job is next to run.

This is exactly what happened: our Heroku dyno running workerscheduler has a 512MB memory cap. Trying to pull the job queue caused it to OOM and crash, then restart, and repeat indefinitely.

What’s the right approach?

Don’t serialize large objects! Instead, serialize only what’s needed, like the object’s ID, and rehydrate from the database inside the job:

class Email:
    id: str
    subject: str
    body: str

@job('five_minutes')
def send_email(email_id: str, recipients: list[str]):
    email = fetch_email_from_db(email_id)
    for recipient in recipients:
        send_email_to_recipient(email, recipient)

Now, the serialization looks like:

{
  "method_name": "path.to.module.send_email",
  "arguments": [
    "1",
    [
      "penelope@buttondown.com",
      "telemachus@buttondown.com"
    ]
  ]
}

This is our pattern almost everywhere—except in one place, the per-domain rate limiting logic, which unfortunately is exactly what triggered the meltdown.

How did we fix it?

The short-term solution was to clear out the excessively large jobs—using the extra memory available on my laptop:

import django_rq
from rq.job import Job

STRING_OF_JOB_TO_REMOVE = "send_email_to"
QUEUE_NAME = "five_minutes"
queue = django_rq.get_queue(QUEUE_NAME)
jids = queue.scheduled_job_registry.get_job_ids()
jobs = Job.fetch_many(jids, connection=django_rq.get_connection(QUEUE_NAME))

for job in jobs:
    print(job.description)
    if STRING_OF_JOB_TO_REMOVE in job.description:
        queue.scheduled_job_registry.remove(job.id)  # Corrected: should remove by job.id
        print("Removing!")

After removing the problematic jobs and re-running some processes to resume normal flow, everything was back up (with a backlog to process, but operational).

Why didn’t we catch it sooner?

Nearly all of our observability depends on cron-like scheduled jobs. When crons are down, so are our monitoring tools, leaving us blind. That’s a lesson learned.

How are we preventing a recurrence?

Code Fix: The problematic rate limiting path now loads emails by ID instead of serializing the entire object.
Observability: We now rely on Better Stack for monitoring, so our visibility doesn’t depend exclusively on our own infra. Notably, we get paged if no cron jobs run for five minutes—this would have immediately surfaced the problem.
Tooling: We’ve built internal tools to analyze the backlog and improve diagnostics.

While it only took about 30 minutes to locate the issue, we’re working to improve so detection (and resolution) will be even faster next time.

推荐订阅源

Buttondown's blog