DNS Caching in Node.js: The Silent Cause of Production Latency Spikes

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-21 · via The Practical Developer

You triple-checked the downstream service. Its p99 is 12 ms. Your API timeout is 5 seconds. Yet every few minutes a request logs ConnectTimeoutError or ETIMEOUT and you have no idea why. You scale pods, retry harder, blame the cloud provider. The real problem is in a layer most Node.js engineers never think about: your process is resolving the same hostname thousands of times per second, and the OS resolver is drowning.

Node.js does not cache DNS lookups. Every fetch to api.internal.example.com, every Redis connection to redis.production.local, every Postgres connection string parsed and opened, calls getaddrinfo unless something stops it. On a busy service making hundreds of outbound requests per second, you are hammering the OS resolver, the upstream DNS server, and quite possibly blocking the event loop while you do it. This post shows how to confirm it, fix it in application code, and monitor it.

Why Node.js does not cache DNS

Node.js delegates DNS resolution to the operating system through getaddrinfo(3). That is correct for portability, but it means Node has no built-in cache for the results. If your code calls fetch("https://api.example.com/data") ten thousand times per minute, getaddrinfo runs ten thousand times per minute. The OS resolver may have its own cache, but it is usually small, short-lived, and shared across every process on the machine. On Kubernetes, the node-level kube-dns or CoreDNS pod sees a firehose of identical queries from every container.

Worse, dns.lookup is not async in the way you think. Before Node.js 20, dns.lookup ran on the thread pool with limited concurrency. Only a handful of DNS lookups could run in parallel. Extra requests queued. On a high-throughput service, that queue is a bottleneck that looks like network latency but is actually a local resource exhaustion. Node 20+ improved this with getaddrinfo backed by the c-ares library in some paths, but the fundamental problem remains: there is no TTL-aware application-level cache.

The symptoms are predictable once you know what to look for:

Intermittent ConnectTimeoutError or ETIMEDOUT on perfectly healthy downstreams.
Spikes that do not correlate with request rate, CPU, or memory.
Multiple pods failing at the same time, suggesting overload of a shared DNS server.
Timeouts that disappear immediately when you switch to IP addresses.
CoreDNS logs showing thousands of identical queries for the same hostname per minute.

If any of that sounds familiar, you are probably running without DNS caching.

Measuring DNS resolution time in production

You cannot fix what you do not measure. The quickest production diagnostic is instrumenting the dns module directly:

import dns from 'node:dns';
import { performance } from 'node:perf_hooks';

const originalLookup = dns.lookup;

// In production, prefer a proper histogram (see below). This is the minimal version.
function instrumentedLookup(hostname, options, callback) {
  const start = performance.now();

  if (typeof options === 'function') {
    callback = options;
    options = {};
  }

  return originalLookup(hostname, options, (err, address, family) => {
    const duration = performance.now() - start;
    // Ship this to your metrics pipeline
    console.log(JSON.stringify({
      event: 'dns_lookup_timing',
      hostname,
      durationMs: Math.round(duration * 100) / 100,
      cached: false,
      timestamp: new Date().toISOString()
    }));
    callback(err, address, family);
  });
}

dns.lookup = instrumentedLookup;

On a healthy system, DNS lookup for a warm hostname should be sub-millisecond. If you see p50 above 5 ms, or p99 above 50 ms, your resolver is overloaded. If you see durations of 100 ms or more, DNS is queuing or timing out and retrying. That is the smoking gun.

Do not leave the monkey-patch in production. Use it for a single deploy to confirm the diagnosis, then fix the root cause.

The first fix: persistent connections with keep-alive

The cheapest DNS cache is to never look up the hostname again. HTTP keep-alive and connection pooling reuse the same TCP connection for many requests, which means the hostname is resolved exactly once when the first connection opens. This is not a DNS cache per se, but it reduces the lookup rate by orders of magnitude.

If you use undici (which powers Node.js native fetch since Node 18), configure a Pool with keep-alive:

import { Pool } from 'undici';

const pool = new Pool('https://api.example.com', {
  connections: 50,
  keepAliveTimeout: 30000,
  keepAliveMaxTimeout: 60000,
});

// Use pool.request(...) or assign it to a custom fetch dispatcher

If you use node:http or axios, set keepAlive: true on the Agent:

import http from 'node:http';

const agent = new http.Agent({
  keepAlive: true,
  maxSockets: 50,
  maxFreeSockets: 10,
  timeout: 30000,
});

// Pass agent to every request
http.get('https://api.example.com/data', { agent }, (res) => { /* ... */ });

Keep-alive alone often drops DNS lookups from thousands per minute to dozens. If that solves the problem, great. But it only works for HTTP. Database drivers, Redis clients, gRPC channels, and direct TCP connections need their own pooling. If any of those reconnect frequently, you still have a DNS storm.

The real fix: a TTL-aware DNS cache

For services where connections churn, or where you connect to many different hosts, you need an application-level DNS cache that respects TTL. Node.js does not ship one, but you can build it in under 50 lines.

The approach: cache the resolved IP, respect a TTL (or a conservative hard-coded one if you do not want to parse DNS records), and evict entries on expiry. Use it inside a custom lookup function that you pass to your HTTP agent, database driver, or Redis client.

Here is a minimal, production-safe cache:

import dns from 'node:dns';
import { promisify } from 'node:util';

const dnsLookup = promisify(dns.lookup);

class DnsCache {
  constructor({ defaultTtlMs = 60_000, maxEntries = 1000 } = {}) {
    this.cache = new Map();
    this.defaultTtlMs = defaultTtlMs;
    this.maxEntries = maxEntries;
  }

  async lookup(hostname, options) {
    const now = Date.now();
    const cached = this.cache.get(hostname);

    if (cached && cached.expiresAt > now) {
      return { address: cached.address, family: cached.family };
    }

    const result = await dnsLookup(hostname, options);

    // Evict oldest if we are at capacity (simple LRU behavior)
    if (this.cache.size >= this.maxEntries) {
      const firstKey = this.cache.keys().next().value;
      this.cache.delete(firstKey);
    }

    this.cache.set(hostname, {
      address: result.address,
      family: result.family,
      expiresAt: now + this.defaultTtlMs,
    });

    return result;
  }

  get size() {
    return this.cache.size;
  }
}

const dnsCache = new DnsCache({ defaultTtlMs: 30_000 });

You can pass this cache to any library that accepts a custom lookup function. For undici or node:http agents:

import { Agent } from 'undici';

const agent = new Agent({
  connect: {
    lookup: async (hostname, options, callback) => {
      try {
        const { address, family } = await dnsCache.lookup(hostname, options);
        callback(null, address, family);
      } catch (err) {
        callback(err);
      }
    },
  },
});

For ioredis or redis clients that support a custom lookup option:

import Redis from 'ioredis';

const redis = new Redis({
  host: 'redis.production.local',
  lookup: (hostname, options, callback) => {
    dnsCache.lookup(hostname, options)
      .then(({ address, family }) => callback(null, address, family))
      .catch((err) => callback(err));
  },
});

This cache is intentionally simple. It does not parse real DNS TTLs, but a 30-second default TTL is usually safe for internal services. For public hostnames where A records may shift, monitor cache hit rates and lower TTL accordingly.

Getting TTL from the DNS layer

If you want to be more precise, use dns.resolve instead of dns.lookup. dns.resolve queries the DNS server directly and returns TTL values on some record types (A, AAAA, CNAME). Node.js exposes this through dns.resolve4 and dns.resolve6.

import dns from 'node:dns';
import { promisify } from 'node:util';

const resolve4 = promisify(dns.resolve4);

async function resolveWithTtl(hostname) {
  // Node 20+ supports the ttl option
  const records = await resolve4(hostname, { ttl: true });
  return records.map((r) => ({
    address: r.address,
    ttlMs: r.ttl * 1000,
  }));
}

Using real TTLs prevents caching a record longer than the domain owner intended. For internal .local or .svc.cluster.local hostnames, the TTL is often short (5 seconds in some Kubernetes DNS setups), which means caching must be aggressive to help. In practice, if the upstream DNS itself returns 5-second TTLs, pinning the IP for even 10-15 seconds reduces the query rate by 2-3x with minimal risk. Just make sure you handle resolution failures gracefully: if the cached IP becomes unreachable, a lookup failure should trigger an immediate cache bypass and retry.

Platform-level DNS caching

Before you write a cache in every application, check whether your platform already has one. On Linux servers:

systemd-resolved caches DNS with configurable TTLs. If it is active, getaddrinfo hits the local daemon rather than the upstream server.
dnsmasq or unbound can run as a local caching resolver in your container or on the node.
In Kubernetes, CoreDNS has a default cache plugin that caches responses for 30 seconds. But that cache is per-CoreDNS-pod, shared across every container on the node. Under heavy load, it can still saturate.

Application-level caching gives you isolation. A misbehaving neighbor pod cannot evict your records from the CoreDNS cache. You also get observability: you can log cache hits, misses, and TTLs without parsing CoreDNS logs.

If you run on AWS, EC2 and Fargate both cache VPC DNS internally, but the cache is small and shared. We have still seen DNS throttling on high-throughput Node.js services that rely on it alone. The application cache is the final defense.

Monitoring and alerting

Once you deploy a cache, you want to know it is working. Export these metrics:

class DnsCache {
  constructor(options) {
    this.cache = new Map();
    this.defaultTtlMs = options?.defaultTtlMs ?? 60_000;
    this.maxEntries = options?.maxEntries ?? 1000;
    this.hits = 0;
    this.misses = 0;
  }

  async lookup(hostname, options) {
    const now = Date.now();
    const cached = this.cache.get(hostname);

    if (cached && cached.expiresAt > now) {
      this.hits++;
      return { address: cached.address, family: cached.family };
    }

    this.misses++;
    const result = await dnsLookup(hostname, options);
    /* ... */
    return result;
  }

  stats() {
    const total = this.hits + this.misses;
    return {
      hits: this.hits,
      misses: this.misses,
      hitRate: total === 0 ? 0 : Math.round((this.hits / total) * 1000) / 10,
      size: this.cache.size,
    };
  }
}

Expose stats() on a /health or /metrics endpoint. Alert if:

hitRate drops below 70% after the cache has warmed up. That means TTLs are too short, or connections are churning too fast.
misses spike suddenly. That suggests a deploy cleared the cache, or a hostname is failing to resolve and retries are bypassing the cache.
DNS lookup duration from the instrumented version goes above 50 ms. If lookups are slow even with a local cache, the OS resolver is the bottleneck.

A full example: wrapping a service client

Here is how to wire the cache into a real service-client pattern:

import { Agent, request } from 'undici';

const dnsCache = new DnsCache({ defaultTtlMs: 30_000 });

const agent = new Agent({
  connections: 50,
  connect: {
    lookup: (hostname, options, callback) => {
      dnsCache.lookup(hostname, options)
        .then(({ address, family }) => callback(null, address, family))
        .catch((err) => callback(err));
    },
  },
});

async function fetchUser(userId) {
  const { body } = await request(
    `https://users-api.internal/users/${userId}`,
    { dispatcher: agent, method: 'GET' }
  );
  return body.json();
}

The first request to users-api.internal resolves the hostname and caches the IP for 30 seconds. Every subsequent request in that window reuses the cached address. If the keep-alive pool also holds the TCP connection open, the total DNS lookup rate drops to near zero.

The decision tree

Situation	Fix
High-rate HTTP calls to one hostname	Keep-alive pool first, DNS cache as backup
Frequent database or Redis reconnects	Custom `lookup` with TTL cache in the client config
Node 18 using native `fetch` without dispatcher	Switch to `undici` Agent or global `fetch` dispatcher override
Multiple hostnames, short-lived connections per host	TTL-aware `dns.lookup` cache is essential
Kubernetes with very high pod density	Add application cache even if CoreDNS caching is on
Debugging unexplained connect timeouts on healthy targets	Instrument `dns.lookup` duration to confirm DNS is the cause

The takeaway

DNS is not free. On a busy Node.js service, it is a hidden tax on every outbound connection, and the default stack does nothing to amortize it. The symptoms are maddening: random timeouts on healthy downstreams, latency spikes with no CPU profile blame, and errors that vanish when you switch to raw IPs.

Measure it first: patch dns.lookup for one deploy and log the duration. If p99 is above a few milliseconds, you have a DNS problem. Fix it with keep-alive and connection pooling to eliminate redundant lookups, then add a TTL-aware application cache for anything that still reconnects frequently. Most services can drop their DNS query rate by 100x with fewer than 80 lines of code.

Your downstream APIs are fast. Your network is fine. Make sure your process is not spending its time asking the same question over and over.

A note from Yojji

Infrastructure reliability is about fixing the layers no one talks about until they break at 2 AM. DNS caching, connection pooling, and request path instrumentation are exactly the kind of operational detail Yojji engineers build into production systems from the start.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their team of 50+ senior engineers has completed hundreds of projects using Node.js, TypeScript, and cloud-native architecture. If your team is dealing with unexplained latency spikes or building high-throughput microservices, Yojji can help you ship infrastructure that stays fast under real load.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Why Node.js does not cache DNS

Measuring DNS resolution time in production

The first fix: persistent connections with keep-alive

The real fix: a TTL-aware DNS cache

Getting TTL from the DNS layer

Platform-level DNS caching

Monitoring and alerting

A full example: wrapping a service client

The decision tree

The takeaway

A note from Yojji