Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-04 · via The Practical Developer

A downstream service hiccups for 800ms. A few seconds later your dashboards show a clean recovery. Two minutes later, the same service goes hard down, except this time it stays down for 40 minutes, takes three other services with it, and the post-mortem points the finger at a config change that did nothing wrong.

What happened is the most boring incident in distributed systems: every client retried, every retry hit the same already-overloaded service, every failure caused another retry, and the synchronized wave of req → fail → retry from a few thousand callers produced more load than the service had ever handled in a normal day. The hiccup was a hiccup. The retries were the outage.

This pattern, the retry storm, is in every “self-healing” client people add the week after their first 5xx. It is also entirely avoidable. The fix is not “fewer retries.” It is retries that understand they are part of a fleet: exponential backoff with proper jitter, a retry budget that caps the blast radius, a deadline that travels with the request, and a short list of errors that are worth retrying at all. Together those four things take maybe sixty lines of TypeScript and they are the difference between a 503 that recovers in five seconds and an outage that fills your weekend.

Mistake #1: retrying everything

The first mistake is treating “the call failed” as a single category. It is not. There are at least four classes of failure, and only one of them benefits from retrying.

// Bad: retry everything that throws.
async function call(url: string) {
  for (let i = 0; i < 3; i++) {
    try {
      return await fetch(url);
    } catch (err) {
      // ... retry?
    }
  }
}

400 Bad Request is not retryable; the input is wrong, retrying will just send the same wrong input again. 401 Unauthorized is not retryable until you refresh the token. 404 is definitive. 409 Conflict usually means a domain rule rejected the request and retrying would either no-op or repeat the conflict.

The set of errors actually worth retrying is small:

Transient network errors: ECONNRESET, ETIMEDOUT, EAI_AGAIN, socket hang up. The connection itself failed.
502, 503, 504: the service or its upstream is unavailable or overloaded.
429 Too Many Requests: but only after honoring the Retry-After header.
408 Request Timeout: server-side timeout.
Idempotent 5xx: retry only when the operation is safe to repeat (more on idempotency below).

Anything else, fail fast. This sounds aggressive until you remember: a non-retryable error retried six times is just six errors instead of one, plus a longer client latency, plus six wasted RPS against a service that already told you “no.”

type RetryDecision = 'retry' | 'fail';

function classify(err: unknown, res?: Response): RetryDecision {
  if (res) {
    if (res.status === 429 || res.status === 408) return 'retry';
    if (res.status >= 500 && res.status <= 599) return 'retry';
    return 'fail';
  }
  const code = (err as NodeJS.ErrnoException)?.code;
  if (code === 'ECONNRESET' || code === 'ETIMEDOUT' ||
      code === 'EAI_AGAIN' || code === 'ECONNREFUSED' ||
      code === 'UND_ERR_SOCKET') return 'retry';
  return 'fail';
}

This function is the entire policy. Everything else is mechanics.

Mistake #2: retrying without backoff

The naive loop is the one in every tutorial:

for (let i = 0; i < 5; i++) {
  try { return await call(); }
  catch { /* try again immediately */ }
}

If the downstream is overloaded, this is a way to send five requests in 50ms instead of one in 50ms. Multiply that by ten thousand clients and the downstream service’s “recovery window” is now a tighter feedback loop than it was before.

The minimum bar is exponential backoff: double the wait between attempts.

const base = 100; // ms
for (let attempt = 0; attempt < 5; attempt++) {
  try { return await call(); }
  catch (err) {
    if (classify(err) === 'fail') throw err;
    await sleep(base * 2 ** attempt); // 100, 200, 400, 800, 1600
  }
}

This helps. It does not solve the real problem.

Mistake #3: backoff without jitter

If every client uses the same backoff schedule, every client retries at the same time. The downstream sees a smooth load curve under steady state and a series of synchronized spikes during a partial outage. The spikes are bigger than steady-state load and they line up perfectly with the moments the service is most fragile.

The fix is jitter, adding randomness to the wait. There are three flavors people argue about; only one is right for almost every case.

Full jitter: pick a random value between 0 and the current cap.

function fullJitter(attempt: number, base = 100, cap = 30_000) {
  const exp = Math.min(cap, base * 2 ** attempt);
  return Math.random() * exp;
}

Equal jitter: half deterministic, half random.

function equalJitter(attempt: number, base = 100, cap = 30_000) {
  const exp = Math.min(cap, base * 2 ** attempt);
  return exp / 2 + Math.random() * (exp / 2);
}

Decorrelated jitter: each wait is computed from the previous wait, not from the attempt number.

function decorrelatedJitter(prev: number, base = 100, cap = 30_000) {
  return Math.min(cap, base + Math.random() * (prev * 3 - base));
}

The AWS Architecture Blog post that introduced these compared them on a simulated overload scenario and the practical result is well-replicated: full jitter and decorrelated jitter both flatten the spike effectively. Decorrelated jitter has slightly better worst-case latency under heavy contention because it does not correlate to the attempt counter when many clients have been retrying for a while. For most services, full jitter is fine; for services where you expect a long tail of retrying clients (auth services, payment processors, anything with a global tail), pick decorrelated.

What you should never do is the deterministic schedule with no randomness, or “jitter” implemented as ± 10% of the deterministic value. The first creates the spike. The second narrows it without flattening it.

Mistake #4: no retry budget

Here is the pattern that turns a small outage into a giant one. Imagine your service makes one call to payment-gateway per inbound request. Normally it succeeds. During a hiccup, every call fails and your client retries five times. Your inbound RPS is unchanged. Your outbound RPS to payment-gateway is now 6x of normal.

If payment-gateway was already at 70% of capacity, you just put it at 420% of capacity. There is no recovery window because every recovering attempt gets buried by the next wave of retries.

A retry budget caps the additional load retries can generate. The classic implementation, used in gRPC and Finagle and several large-scale Envoy deployments, is a token bucket: every successful call adds a token; every retry costs a token. When the bucket is empty, retries are skipped.

class RetryBudget {
  private tokens: number;
  constructor(
    private readonly capacity = 100,
    private readonly retryRatio = 0.1, // retries can be at most 10% of successful calls
  ) {
    this.tokens = capacity;
  }

  /** Call after a successful underlying request. */
  onSuccess() {
    this.tokens = Math.min(this.capacity, this.tokens + 1);
  }

  /** Call before a retry. Returns false if budget is exhausted. */
  tryConsume(): boolean {
    const cost = 1 / this.retryRatio; // 10 tokens per retry at 10% ratio
    if (this.tokens >= cost) {
      this.tokens -= cost;
      return true;
    }
    return false;
  }
}

retryRatio = 0.1 means: across the population of calls, retries can add at most 10% extra load. Under normal operation the bucket sits near full and retries flow freely. Under partial failure, the success rate drops, the bucket drains, and retries automatically start being skipped, which is exactly the moment you want them skipped, because the downstream is already struggling.

This is the single most important pattern in this post. A team that has exponential backoff with jitter but no retry budget will still produce retry storms during a multi-second outage. A team that has only a retry budget, no backoff at all, will not.

Mistake #5: ignoring the request deadline

Retries cost time. If your handler has a 2-second SLO and the first attempt times out at 1.8 seconds, retrying is mathematically pointless: the client gave up, your inbound load balancer hung up, and the next retry is doing work for nobody. (See AbortController in Node.js for the wider problem.)

Every retry policy should be bounded by a deadline that propagates from the inbound request, not just by a max attempt count. The simplest version: each retry checks how much time is left before scheduling the next attempt.

async function withRetry<T>(
  fn: (signal: AbortSignal) => Promise<T>,
  opts: { signal: AbortSignal; deadlineMs: number; budget: RetryBudget },
): Promise<T> {
  const start = Date.now();
  let lastWait = 100;
  let attempt = 0;

  while (true) {
    const remaining = opts.deadlineMs - (Date.now() - start);
    if (remaining <= 0 || opts.signal.aborted) {
      throw new Error('deadline exceeded');
    }
    try {
      const result = await fn(opts.signal);
      opts.budget.onSuccess();
      return result;
    } catch (err) {
      if (classify(err) === 'fail') throw err;
      if (!opts.budget.tryConsume()) throw err;

      const wait = decorrelatedJitter(lastWait);
      lastWait = wait;
      // Never sleep past the deadline.
      const sleepFor = Math.min(wait, remaining - 50);
      if (sleepFor <= 0) throw err;
      await sleep(sleepFor);
      attempt++;
    }
  }
}

function sleep(ms: number) {
  return new Promise((r) => setTimeout(r, ms));
}

A few details people get wrong even when they get the rest right:

The deadline is absolute, not per-attempt. A “5 second timeout per call, 3 retries” client can take 15 seconds in the worst case. That is not what your inbound caller expects.
Subtract a small buffer (- 50) before the deadline. Otherwise the last attempt times out the moment it starts, which is just expensive failure.
Honor Retry-After. A 429 or 503 with Retry-After: 5 is the server explicitly telling you when retrying is welcome. Use that instead of your jittered value if it is larger.

function honorRetryAfter(res: Response, fallback: number): number {
  const header = res.headers.get('Retry-After');
  if (!header) return fallback;
  const seconds = Number(header);
  if (!Number.isNaN(seconds)) return Math.max(fallback, seconds * 1000);
  const dateMs = Date.parse(header);
  if (!Number.isNaN(dateMs)) return Math.max(fallback, dateMs - Date.now());
  return fallback;
}

Mistake #6: retrying non-idempotent operations

Retrying POST /payments is how you charge a customer twice. Retrying POST /send-email is how customers get five copies of the same notification. Retries are only safe when the operation is idempotent, when “do this twice” produces the same observable result as “do this once.”

Some operations are naturally idempotent (PUT /users/:id, DELETE /orders/:id). Some can be made idempotent with an idempotency key, a client-generated identifier the server uses to deduplicate. The full pattern is in the post on idempotency keys; the short version is:

const key = crypto.randomUUID();
await withRetry(
  (signal) => fetch('/payments', {
    method: 'POST',
    headers: { 'Idempotency-Key': key, 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
    signal,
  }),
  { signal: req.signal, deadlineMs: 5_000, budget: paymentBudget },
);

The same key on every retry. The server stores the key + the response, and a duplicate request returns the cached response instead of re-executing.

If an endpoint is not idempotent and you cannot add an idempotency key, set attempts = 1 and stop. A retry that double-charges is worse than the failure it is trying to avoid.

Putting it together

The full client is small:

const paymentBudget = new RetryBudget(100, 0.1);

export async function chargeCard(payload: ChargePayload, parent: AbortSignal) {
  const key = crypto.randomUUID();
  return withRetry(
    async (signal) => {
      const res = await fetch('https://payment-gateway.internal/charge', {
        method: 'POST',
        headers: {
          'Idempotency-Key': key,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify(payload),
        signal,
      });
      if (!res.ok) {
        const err = new Error(`payment ${res.status}`);
        (err as any).status = res.status;
        (err as any).response = res;
        throw err;
      }
      return res.json();
    },
    { signal: parent, deadlineMs: 4_000, budget: paymentBudget },
  );
}

Sixty lines, including the budget and the classifier. It will outperform every “smart” retry library that does not implement a budget, because under partial failure the budget is what saves the downstream.

How to test it before production tests it

You will not catch retry-storm behavior in unit tests. The only way to see it is to inject failure under load.

A short k6 scenario that runs your client against a flaky simulator is enough:

// k6 script: 30s normal load, 30s downstream fails 80% of requests
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  scenarios: {
    constant: { executor: 'constant-arrival-rate', rate: 200, timeUnit: '1s', duration: '60s', preAllocatedVUs: 50 },
  },
};

export default function () {
  http.get('http://localhost:3000/api/charge'); // your service that wraps the client
  sleep(0);
}

Run it with the simulated downstream healthy and your outbound RPS to the downstream should hover around inbound RPS. Run it again with the simulated downstream returning 503 for 80% of requests and watch the outbound RPS. With a working budget it climbs by 10% and plateaus. Without one it climbs by 6x and stays there.

This is also the test that catches the version of the bug where someone “tightens” the retry policy by setting attempts = 8. The budget should mean attempts = 8 and attempts = 3 produce the same outbound RPS during failure. They will if everything else is wired correctly.

Where retries fit in the bigger picture

Retries are one layer in a stack of resilience patterns. A short map:

Timeouts stop you from waiting on a hung dependency.
Retries (this post) recover from transient failures.
Circuit breakers stop calling a dependency that is clearly broken so the retries do not pile up.
Bulkheads / pool limits stop a slow dependency from exhausting your concurrency.
Hedging-style request fanout (out of scope here) shaves p99 latency at the cost of multiplying load, and so should always sit behind a budget.

Retries without a circuit breaker is the configuration that produces a retry storm. Retries with a circuit breaker but no budget produces the storm slightly later. Retries with a budget and a breaker, properly tested under failure, produces a service that recovers gracefully from the kind of 800ms hiccup that does not need to make it into a post-mortem at all.

The whole point is that none of this is exotic. It is the plumbing that turns a “self-healing” client from a marketing word into a true claim, and most teams have a 60-line gap between where they are and where this post is.

A note from Yojji

Most of the work in this post is unglamorous: deciding which errors to retry, building a budget that drains during partial failure, wiring deadlines through every layer, and load-testing the whole thing before production tests it for you. It is the difference between a service that recovers from a downstream blip and one that turns the blip into the incident.

That kind of careful, production-aware backend engineering is exactly what Yojji ships. Yojji is an international custom software development company, founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript stack (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architectures, and they run dedicated senior outstaffed teams alongside full-cycle product engagements covering discovery, design, development, QA, and DevOps.

If your team would rather hire the practice of building reliable, well-instrumented distributed services than learn it the hard way during a peak-hour retry storm, Yojji is worth a conversation.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Mistake #1: retrying everything

Mistake #2: retrying without backoff

Mistake #3: backoff without jitter

Mistake #4: no retry budget

Mistake #5: ignoring the request deadline

Mistake #6: retrying non-idempotent operations

Putting it together

How to test it before production tests it

Where retries fit in the bigger picture

A note from Yojji