The Four Timeouts Every Node.js HTTP Client Needs

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-15 · via The Practical Developer

Your service is not down. The downstream API is not down. But every request your Node.js service makes to it hangs forever, and your own health checks eventually fail.

The connection pool is full. Sixteen sockets, all marked ESTABLISHED in netstat, all idle, all dead. Somewhere between your Kubernetes pod and the upstream load balancer, a quiet network event happened: a NAT table entry expired, a load balancer shifted, a container restarted. TCP is a reliable protocol, but it is only reliable when someone tells the truth. When a peer disappears without sending a FIN or RST, the remaining end will wait indefinitely unless you told it not to.

Node.js defaults do not tell it not to. By default, http.request() waits forever for a connection, waits forever for a response, and keeps pooled sockets open forever without probing them. The Node runtime is fast; the absence of timeouts is not. This post is the four values you set once and verify with iptables so that a silent network partition becomes a fast error instead of a slow outage.

The shape of the failure

You see it first as latency, not errors. p50 stays flat; p99 climbs, then p99.9 climbs, then the latency histogram turns into a single bar at your maximum client timeout. If your callers have no maximum timeout, the histogram never settles; requests just accumulate.

Inside the process, netstat or ss shows a pool of sockets in ESTABLISHED to the upstream IP. The event loop is not blocked; the sockets are simply waiting for data that will never arrive. If you use an HTTP Agent with maxSockets: 16, the 17th request queues behind the first 16 and never runs. The downstream API is healthy, fast, and has plenty of capacity, but your process never reaches it because the front of the queue is occupied by ghosts.

This is not a bug in Node.js. It is a missing configuration. There are four timeouts, and you need all four because they guard four different phases of a connection lifecycle.

1. Connect timeout: how long the handshake may take

Before any HTTP data flows, TCP must complete its three-way handshake. In a healthy data center this takes a millisecond. Across regions, maybe twenty. If a SYN packet is black-holed (wrong security group, failed NAT, upstream instance terminated) Node.js will retransmit with exponential backoff and wait for roughly 75 seconds by default on Linux. That is system-default territory, not application-default territory, and it is far too long for a service that should fail fast.

Node’s native http module does not expose a connect timeout directly. You build it by racing the request against a timer and destroying the socket if the timer fires first.

import http from 'node:http';

function requestWithConnectTimeout(url, options = {}, connectMs = 5_000) {
  return new Promise((resolve, reject) => {
    const req = http.request(url, options, (res) => {
      clearTimeout(timer);
      resolve(res);
    });

    const timer = setTimeout(() => {
      req.destroy(new Error(`Connect timeout after ${connectMs}ms`));
      reject(new Error(`Connect timeout after ${connectMs}ms`));
    }, connectMs);

    req.on('error', (err) => {
      clearTimeout(timer);
      reject(err);
    });

    req.end();
  });
}

Five seconds is generous for a service-to-service call inside the same cloud region. One second is often enough. The point is not the exact number; the point is that the number exists and is bounded.

2. Response timeout: how long until the first byte

Once the connection is established and the request is sent, how long do you wait for the server to respond? Node.js calls this request.setTimeout(), and it measures time-to-first-byte: headers must start arriving before the timer fires.

function requestWithResponseTimeout(url, options = {}, responseMs = 10_000) {
  return new Promise((resolve, reject) => {
    const req = http.request(url, options, (res) => {
      clearTimeout(timer);
      resolve(res);
    });

    const timer = setTimeout(() => {
      req.destroy(new Error(`Response timeout after ${responseMs}ms`));
      reject(new Error(`Response timeout after ${responseMs}ms`));
    }, responseMs);

    req.on('error', (err) => {
      clearTimeout(timer);
      reject(err);
    });

    req.end();
  });
}

Do not conflate this with a total wall-clock deadline. A response timeout of 30 seconds is reasonable for an expensive database export endpoint. A response timeout of 30 seconds for a user-profile lookup is not. Match the timeout to the endpoint’s expected worst case, not to a global default.

If you use the global request object timeout (req.setTimeout(ms)), Node.js will fire the 'timeout' event without automatically destroying the request. You must listen and abort yourself. The explicit setTimeout + req.destroy() pattern above is clearer and harder to miss.

3. Socket idle timeout: how long a pooled socket may sit unused

Connection reuse is fast. Keeping a socket open for the next request avoids another TCP handshake and TLS negotiation. But a socket that has been idle for five minutes is statistically more likely to belong to a ghost than to a healthy peer.

The http.Agent controls this with keepAlive: true and keepAliveMsecs. The name is misleading: keepAliveMsecs is not TCP keepalive. It is the minimum time the agent will keep a socket open after the last request finishes, before the agent itself closes it. There is no idleTimeout in the native agent, so you must combine a short keepAliveMsecs with a custom agent that tracks last-used time, or switch to a modern client.

With undici (the HTTP client that powers Node.js 18+ global fetch) the concept is explicit:

import { Agent } from 'undici';

const agent = new Agent({
  connect: {
    timeout: 5_000,           // connect timeout
    rejectUnauthorized: true,
  },
  bodyTimeout: 30_000,        // time to receive full body
  headersTimeout: 10_000,     // time to receive headers (response timeout)
  keepAliveTimeout: 30_000,   // idle socket timeout
  keepAliveMaxTimeout: 30_000,
  maxRequestsPerSocket: 100,  // rotate sockets periodically
});

keepAliveTimeout: 30_000 means a socket is evicted from the pool after 30 seconds of idleness. That is short enough that a stale NAT mapping or a silently-replaced load balancer target will not fool you for long, and long enough that a burst of traffic benefits from reuse.

If you are still on the native http module and cannot migrate yet, cap the total lifetime with maxRequestsPerSocket or create a fresh Agent with a bounded socket pool and periodic agent.destroy() in a background timer. It is crude but it works.

4. TCP keepalive: probing dead peers at the OS level

The first three timeouts guard the edges: connecting, waiting for a response, and retiring idle pool members. But what if a socket is mid-request when the peer dies? Or what if your response timeout is intentionally long (a large file transfer, a streaming endpoint) and you want to detect a dead peer inside that long window?

TCP keepalive sends empty probe packets after a period of silence. If the peer does not acknowledge them, the kernel declares the connection dead and closes the socket, which causes Node.js to emit an 'error' event that you can handle. Without keepalive, a socket can sit in ESTABLISHED forever, convinced the peer is alive because no one contradicted it.

Enable it per-request by overriding createConnection on the Agent:

import http from 'node:http';
import net from 'node:net';

class KeepaliveAgent extends http.Agent {
  createConnection(options, callback) {
    const socket = net.createConnection(options);
    socket.setKeepAlive(true, 5_000);   // probe after 5s of silence
    socket.setNoDelay(true);
    socket.on('connect', () => callback(null, socket));
    socket.on('error', callback);
    return socket;
  }
}

const agent = new KeepaliveAgent({
  keepAlive: true,
  maxSockets: 16,
  maxFreeSockets: 4,
});

socket.setKeepAlive(true, 5000) tells the OS to start sending keepalive probes after 5 seconds of idleness. The exact interval and retry count depend on OS-level sysctl settings (net.ipv4.tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), so treat the socket-level value as a lower bound, not a contract. On Linux the defaults are often 7200 seconds, which is useless; setting the socket option overrides the start delay, but the interval and probe count still follow the system unless you also tune the container.

In containers, add a sysctl init container or tune the node if you control it:

# sysctl init container snippet
- name: sysctl
  image: busybox
  command: ['sh', '-c', 'sysctl -w net.ipv4.tcp_keepalive_time=30 net.ipv4.tcp_keepalive_intvl=5 net.ipv4.tcp_keepalive_probes=3']
  securityContext:
    privileged: true

This shortens the total detection window to roughly 30 + (3 × 5) = 45 seconds. If you do not control the node, set a tighter application-level heartbeat or rely on the response timeout instead.

Putting them together: a production-ready fetch wrapper

If you use the global fetch in Node.js 18+, you are already using undici under the hood, but the global fetch does not expose timeout or keepalive options directly. Pass a custom dispatcher:

import { Agent, request } from 'undici';

const agent = new Agent({
  connect: { timeout: 5_000, keepAlive: true },
  headersTimeout: 10_000,
  bodyTimeout: 30_000,
  keepAliveTimeout: 30_000,
});

export async function fetchWithTimeouts(url, options = {}) {
  const { statusCode, headers, body } = await request(url, {
    ...options,
    dispatcher: agent,
    // undici request options
  });

  const data = await body.json();
  return { statusCode, headers, data };
}

If you need to match the fetch API shape while keeping timeouts, wrap undici’s fetch and pass the dispatcher:

import { Agent, fetch as undiciFetch } from 'undici';

const agent = new Agent({
  connect: { timeout: 5_000 },
  headersTimeout: 10_000,
  bodyTimeout: 30_000,
  keepAliveTimeout: 30_000,
});

export function fetchWithTimeouts(url, options = {}) {
  return undiciFetch(url, {
    ...options,
    dispatcher: agent,
  });
}

For the native http/https module, compose the pieces into one helper:

import https from 'node:https';

const agent = new https.Agent({
  keepAlive: true,
  maxSockets: 16,
  maxFreeSockets: 4,
});

export function httpsRequest(url, options = {}) {
  const connectMs = options.connectTimeout ?? 5_000;
  const responseMs = options.responseTimeout ?? 10_000;

  return new Promise((resolve, reject) => {
    const start = Date.now();
    const req = https.request(url, { agent, ...options }, (res) => {
      clearTimeout(responseTimer);
      resolve(res);
    });

    const connectTimer = setTimeout(() => {
      req.destroy(new Error(`Connect timeout after ${connectMs}ms`));
    }, connectMs);

    const responseTimer = setTimeout(() => {
      req.destroy(new Error(`Response timeout after ${responseMs}ms`));
    }, responseMs);

    req.on('socket', (socket) => {
      socket.setKeepAlive(true, 5_000);
      socket.setNoDelay(true);
      socket.on('connect', () => clearTimeout(connectTimer));
    });

    req.on('error', (err) => {
      clearTimeout(connectTimer);
      clearTimeout(responseTimer);
      reject(err);
    });

    req.end();
  });
}

Use this helper everywhere. Do not sprinkle ad-hoc timeouts across your codebase; the inconsistency will hide bugs.

Why one zombie socket kills throughput

Suppose your downstream API is healthy and p50 response time is 50 ms. You set maxSockets: 16 in the Agent. Under normal load, 16 concurrent requests share the pool, finish in 50 ms, and the next batch reuses or creates fresh sockets. Throughput is roughly 320 req/s.

Now a network event kills half the pooled sockets without a TCP close. Eight sockets are dead. The next eight requests grab those dead sockets, send data, and wait. The remaining eight requests grab healthy sockets and finish in 50 ms. But because there is no idle timeout or keepalive, the dead sockets are never evicted. They sit in ESTABLISHED forever.

Your effective pool size shrinks from 16 to 8. Throughput drops to 160 req/s. Load increases. The queue grows. Latency climbs from 50 ms to seconds. Eventually the queue exceeds your caller’s patience, and the failure looks like a downstream outage even though the API is fine.

The math is simple: maxSockets is a promise you make to the downstream, but if you do not bound the lifetime of each socket, the promise is not kept. The four timeouts are the enforcement mechanism.

Production signals

Add metrics that prove the timeouts are working, not just configured.

Track outbound request latency by host. A bimodal distribution (one peak at normal latency, another at exactly your timeout value) means requests are timing out rather than failing fast. That is a signal to shorten the timeout or to investigate the network path.

Track socket pool utilization. In undici, pool stats are available on the Agent. In the native module, count agent.sockets and agent.freeSockets periodically. If freeSockets is flat at the limit while requests queue, you have a leak or a ghost pool.

Track TCP retransmits and ESTABLISHED socket count per destination from the host or sidecar. A high ratio of sockets to request rate indicates churn or ghosts.

Alert on timeout ratio, not timeout count. A spike of timeouts during a deployment is expected. A steady 2% timeout rate on a stable endpoint means the network or the peer is lying, and your timeouts are the only reason you are not completely down.

The practical takeaway

When a Node.js service makes outbound HTTP calls, copy this checklist into the client initialization:

Connect timeout: bounded, typically 1-5 seconds.
Response timeout: bounded, matched to the endpoint’s realistic worst case.
Socket idle timeout: pool sockets evicted after 15-60 seconds of idleness.
TCP keepalive: enabled, with a start delay of 5-30 seconds, and OS-level probes tuned if you control the node.

Set them in one shared helper or dispatcher. Never rely on the defaults. The defaults assume a reliable network and honest peers, and production has neither.

Test the behavior with iptables in a local container:

# Blackhole the upstream IP after the connection is established
iptables -A OUTPUT -p tcp -d <upstream-ip> --dport 443 -j DROP

Watch your service. Without the four timeouts, it hangs. With them, it errors in seconds and your retry or circuit-breaker logic handles the rest. That is the difference between a blip and an outage.

A note from Yojji

The kind of edge-case infrastructure work this post describes (mapping TCP socket lifecycles to application-level reliability, tuning OS keepalive parameters inside containers, and verifying failure modes with iptables rather than hoping the defaults hold) is exactly the backend engineering that separates a prototype from a production service.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. Their engineers specialize in the JavaScript ecosystem, cloud platforms, and microservices architecture, including the network-layer and runtime-level details that keep Node.js services stable when the datacenter is not.

If you would rather have outbound HTTP reliability handled by engineers who have already debugged the 2 a.m. socket ghost hunt, Yojji is worth a conversation.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

The shape of the failure

1. Connect timeout: how long the handshake may take

2. Response timeout: how long until the first byte

3. Socket idle timeout: how long a pooled socket may sit unused

4. TCP keepalive: probing dead peers at the OS level

Putting them together: a production-ready fetch wrapper

Why one zombie socket kills throughput

Production signals

The practical takeaway

A note from Yojji