WebSocket Reconnection with Backoff and State Recovery in Production

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-26 · via The Practical Developer

The live dashboard froze at 14:03. Not a crash. The WebSocket connection showed green in the browser DevTools, but the last trade timestamp was six minutes old. Three engineers stared at the chart until someone hit refresh, and the missing trades flooded in all at once. The server had restarted for a routine deployment. The client had reconnected in under a second, but the new server process had no memory of the connection and no concept of what events the client had already seen. The socket was open. The feed was dead.

This is the most common failure mode in production WebSocket systems, and it is rarely handled well. Most tutorials show you how to open a socket and listen for onmessage. They do not show you how to survive a server restart, a network blip, or a load balancer that drops idle connections after 60 seconds. Reconnecting is not enough. You need to reconnect politely (with backoff), detect silence (with heartbeats), and recover state (with event IDs). This post covers all three, with working code you can drop into a browser or a Node.js client today.

Why instant reconnection is worse than no reconnection

The naive approach looks like this:

const ws = new WebSocket(url);
ws.onclose = () => {
  setTimeout(() => {
    const ws2 = new WebSocket(url); // reconnect immediately
  }, 1000);
};

This code is dangerous for three reasons.

First, it lacks backoff. If the server restarts and needs 10 seconds to accept connections, 1,000 clients will hammer it every second with reconnection attempts. That turns a graceful restart into a denial-of-service event. The server never gets the breathing room to finish startup because it is buried under WebSocket handshake traffic.

Second, it has no maximum delay ceiling. A client with a flaky mobile connection will reconnect every second forever, burning battery and bandwidth.

Third, and most importantly, it throws away state. The new WebSocket instance has no memory of what the old instance received. If the server emitted three events during the 1.2-second gap between onclose and reconnection, those events are gone. The user sees stale data and has no idea anything is wrong.

The fix is a three-layer pattern: transport resilience, heartbeat detection, and state synchronization.

Exponential backoff with full jitter

When a WebSocket closes, the client should wait before reconnecting. The wait time should grow exponentially with each consecutive failure, capped at a maximum, and randomized with jitter to prevent thundering herds.

Full jitter is the safest strategy in practice. For each attempt, you pick a random duration between zero and the exponential cap. This spreads reconnections across the entire interval and eliminates synchronization across clients. It is slightly slower than some alternatives on average, but it is the most friendly to an overloaded server.

Here is the helper:

function reconnectDelay(attempt: number, baseMs = 1000, maxMs = 30000): number {
  const cap = Math.min(maxMs, baseMs * Math.pow(2, attempt));
  return Math.floor(Math.random() * cap);
}

For the first failure, the delay is 0-1000 ms. For the second, 0-2000 ms. By the fifth failure, the cap is 30 seconds. The randomness means 10,000 clients that disconnect at the same time will reconnect over a 30-second window instead of stampeding the server in a single millisecond.

Never use a fixed delay like setTimeout(reconnect, 5000). A fixed delay synchronizes all clients. If a server restarts at 14:00:00, every client with a 5-second fixed delay reconnects at 14:00:05 simultaneously. That is a thundering herd, and it will crash the server you just restarted.

Heartbeats: detecting half-open connections

TCP, like WebSocket, is susceptible to half-open connections. A client that silently loses network access (laptop lid closed, subway tunnel, mobile tower handoff) will not trigger onclose on either side. The server thinks the client is still there. The client thinks the server is still there. Nothing flows, and neither side knows it.

The standard fix is a heartbeat ping-pong. The server sends a ping frame every N seconds. The client responds with a pong. If either side misses a pong within a timeout, the connection is declared dead and closed locally. In the browser, you cannot send ping frames manually, so you must use application-level heartbeats.

A lightweight protocol message works well:

interface HeartbeatMessage {
  type: 'ping' | 'pong';
  ts: number;
}

On the server (Node.js with ws):

import { WebSocketServer } from 'ws';

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws) => {
  ws.isAlive = true;

  ws.on('pong', () => {
    ws.isAlive = true;
  });

  const interval = setInterval(() => {
    if (!ws.isAlive) {
      ws.terminate();
      clearInterval(interval);
      return;
    }
    ws.isAlive = false;
    ws.ping();
  }, 30000);

  ws.on('close', () => clearInterval(interval));
});

This is the canonical pattern from the ws documentation, and it is correct. terminate() closes the socket immediately without waiting for a graceful close handshake, which is what you want for a dead peer. The 30-second interval is a balanced default for most applications. Trade desks and chat apps may want 10 seconds. Batched telemetry pipelines can tolerate 60.

State recovery: the missing piece

Backoff and heartbeats keep the transport healthy. They do not solve the data gap. When a client reconnects, the server must be able to answer one question: “What happened since event 847?”

The solution is an event ID on every message. The server assigns monotonically increasing IDs (per client or globally, depending on your consistency model). The client remembers the highest ID it has received. On reconnect, it sends that ID, and the server replays everything after it.

This sounds simple, but there are two practical constraints.

Buffer size: The server cannot store infinite history. A ring buffer of the last 10,000 events or the last 5 minutes is usually enough for momentary reconnections. If a client has been offline for an hour, you fall back to a snapshot plus live diff rather than replaying 50,000 events.

Global ordering: If you need strict global ordering across all clients, you need a single sequence counter (in Redis, Postgres, or a log-backed stream). If you only need per-client ordering (e.g., a personal notification feed), a per-client counter is simpler and horizontally scalable.

Here is a compact TypeScript client that implements all three layers (transport backoff, heartbeat, and event recovery) and can be dropped into a React hook, a Node.js service, or a plain browser script.

The complete resilient client

interface EventMessage {
  id: number;
  type: string;
  payload: unknown;
}

interface Options {
  url: string;
  heartbeatIntervalMs?: number;
  heartbeatTimeoutMs?: number;
  reconnectBaseMs?: number;
  reconnectMaxMs?: number;
  onMessage: (msg: EventMessage) => void;
  onStatusChange?: (status: 'open' | 'closed' | 'reconnecting') => void;
}

export class ResilientWebSocket {
  private ws: WebSocket | null = null;
  private url: string;
  private lastEventId = 0;
  private reconnectAttempt = 0;
  private reconnectTimer: ReturnType<typeof setTimeout> | null = null;
  private heartbeatTimer: ReturnType<typeof setInterval> | null = null;
  private heartbeatTimeoutTimer: ReturnType<typeof setTimeout> | null = null;
  private intentionallyClosed = false;
  private onMessage: (msg: EventMessage) => void;
  private onStatusChange?: (status: 'open' | 'closed' | 'reconnecting') => void;

  private heartbeatIntervalMs: number;
  private heartbeatTimeoutMs: number;
  private reconnectBaseMs: number;
  private reconnectMaxMs: number;

  constructor(opts: Options) {
    this.url = opts.url;
    this.onMessage = opts.onMessage;
    this.onStatusChange = opts.onStatusChange;
    this.heartbeatIntervalMs = opts.heartbeatIntervalMs ?? 30000;
    this.heartbeatTimeoutMs = opts.heartbeatTimeoutMs ?? 10000;
    this.reconnectBaseMs = opts.reconnectBaseMs ?? 1000;
    this.reconnectMaxMs = opts.reconnectMaxMs ?? 30000;
  }

  connect() {
    this.intentionallyClosed = false;
    this._connect();
  }

  private _connect() {
    if (this.ws) return;

    // append lastEventId so the server can resume the stream
    const resumeUrl = `${this.url}?lastEventId=${this.lastEventId}`;
    this.ws = new WebSocket(resumeUrl);

    this.ws.onopen = () => {
      this.reconnectAttempt = 0;
      this.onStatusChange?.('open');
      this._startHeartbeat();
    };

    this.ws.onmessage = (event) => {
      const raw = event.data.toString();

      if (raw === 'ping') {
        this.ws?.send('pong');
        return;
      }

      try {
        const msg: EventMessage = JSON.parse(raw);
        if (typeof msg.id === 'number') {
          this.lastEventId = msg.id;
        }
        this.onMessage(msg);
      } catch {
        // ignore malformed messages
      }
    };

    this.ws.onclose = () => {
      this._cleanup();
      this.onStatusChange?.('closed');
      if (!this.intentionallyClosed) {
        this._scheduleReconnect();
      }
    };

    this.ws.onerror = () => {
      // let onclose handle the reconnect logic; do not call it twice
    };
  }

  private _startHeartbeat() {
    this.heartbeatTimer = setInterval(() => {
      if (this.ws?.readyState !== WebSocket.OPEN) return;

      this.ws.send('ping');
      this.heartbeatTimeoutTimer = setTimeout(() => {
        // server did not pong in time; close and trigger reconnect
        this.ws?.close();
      }, this.heartbeatTimeoutMs);
    }, this.heartbeatIntervalMs);
  }

  private _cleanup() {
    if (this.reconnectTimer) {
      clearTimeout(this.reconnectTimer);
      this.reconnectTimer = null;
    }
    if (this.heartbeatTimer) {
      clearInterval(this.heartbeatTimer);
      this.heartbeatTimer = null;
    }
    if (this.heartbeatTimeoutTimer) {
      clearTimeout(this.heartbeatTimeoutTimer);
      this.heartbeatTimeoutTimer = null;
    }
    this.ws = null;
  }

  private _scheduleReconnect() {
    this.onStatusChange?.('reconnecting');
    const delay = this._reconnectDelay();
    this.reconnectTimer = setTimeout(() => {
      this.reconnectAttempt++;
      this._connect();
    }, delay);
  }

  private _reconnectDelay(): number {
    const cap = Math.min(
      this.reconnectMaxMs,
      this.reconnectBaseMs * Math.pow(2, this.reconnectAttempt)
    );
    return Math.floor(Math.random() * cap);
  }

  close() {
    this.intentionallyClosed = true;
    this._cleanup();
    this.ws?.close();
  }
}

This class is intentionally boring. It does not use RxJS, it does not depend on React, and it does not hide state in closures that are impossible to test. It is a plain class with explicit timers that you can unit test by passing a mock WebSocket or by running it in Node.js with the ws package.

The critical behavior is in _connect: every reconnection appends lastEventId to the URL. The server reads that parameter and replays buffered events after that ID before switching to live pushes.

Server-side catch-up before live push

The server needs a small adapter to handle the resume handshake. Here is a minimal example with ws and an in-memory ring buffer (swap this for Redis Streams, Postgres LISTEN/NOTIFY, or Kafka in production).

import { WebSocketServer, WebSocket } from 'ws';
import { parse } from 'url';

const MAX_HISTORY = 10000;
const eventHistory: EventMessage[] = [];
let nextId = 1;

function broadcast(msg: EventMessage) {
  eventHistory.push(msg);
  if (eventHistory.length > MAX_HISTORY) eventHistory.shift();
  for (const client of wss.clients) {
    if (client.readyState === WebSocket.OPEN) {
      client.send(JSON.stringify(msg));
    }
  }
}

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws, req) => {
  ws.isAlive = true;
  ws.on('pong', () => { ws.isAlive = true; });

  // resume from lastEventId
  const query = parse(req.url ?? '', true).query;
  const lastId = parseInt(query.lastEventId as string, 10) || 0;
  const missed = eventHistory.filter((e) => e.id > lastId);
  for (const msg of missed) {
    ws.send(JSON.stringify(msg));
  }

  // then join live broadcast
  // (already part of broadcast() above)
});

In production, replace the in-memory eventHistory with a bounded stream (Redis Streams capped at 5,000 entries, or a materialized view in Postgres if you already run it). The key invariant is that the buffer depth must exceed your expected worst-case reconnection window. If your 99th percentile mobile disconnect lasts 8 seconds, and you emit 200 events per second, you need a buffer of at least 1,600 events. Add a 10x safety margin and cap at 20,000.

Common mistakes

Using ws.send blindly without checking readyState === OPEN. After a disconnect, there is a brief window where your application code may still try to publish. Always guard sends, or queue them client-side and flush on onopen.

Letting the browser handle pings natively. The browser WebSocket API auto-responds to server ping frames with pong frames, but you cannot observe whether the server sent the ping. You cannot build a client-side dead-peer detector without application-level heartbeats.

Forgetting to reset reconnectAttempt on success. If you do not reset the counter, a client that suffered five failures an hour ago will still wait 30 seconds on its next disconnect. Reset to zero on every onopen.

Storing lastEventId in localStorage for long-lived sessions. It seems smart, but if the user has two tabs open, each tab advances its own lastEventId. On refresh, the tab reads the highest ID from localStorage, which may belong to the other tab, and skips events. Keep lastEventId in memory per instance, or scope localStorage keys by a tab ID.

The practical takeaway

A production WebSocket client is not a new WebSocket(url) wrapped in a useEffect. It is a state machine with three concerns: transport (backoff and jitter), liveness (heartbeats and timeouts), and semantics (event IDs and catch-up). Neglect any one, and the other two become decorative.

Before your next deploy, run through this checklist:

Reconnection uses exponential backoff with jitter, not a fixed interval.
Backoff has a maximum ceiling (e.g., 30 seconds).
The client resets the attempt counter on every successful open.
Heartbeats run in both directions (server pings, client pongs) with a timeout shorter than the OS TCP retransmit window.
Every event carries a monotonic ID.
The server accepts a lastEventId parameter on connection and replays missed events before pushing live data.
The history buffer is sized for the 99th percentile disconnect duration at peak event volume.
The client does not send without checking readyState === OPEN.

Your socket will disconnect. That is not a failure. The failure is assuming it will not, and building a system that has no vocabulary for “catch me up.”

A note from Yojji

Engineering resilient real-time systems is not about writing clever binary protocols. It is about acknowledging that networks fail, servers restart, and mobile users ride trains through tunnels. The discipline of adding backoff, heartbeat timeouts, and event replay to a WebSocket client is exactly the kind of production-hardened thinking that separates a prototype from a shipping product. Yojji’s teams bring that discipline to the full-cycle applications they build, from real-time dashboards to high-throughput messaging infrastructure.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their senior engineering teams specialize in the JavaScript ecosystem, cloud-native infrastructure on AWS, Azure, and Google Cloud, and the full cycle of product delivery from discovery through DevOps.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Why instant reconnection is worse than no reconnection

Exponential backoff with full jitter

Heartbeats: detecting half-open connections

State recovery: the missing piece

The complete resilient client

Server-side catch-up before live push

Common mistakes

The practical takeaway

A note from Yojji