API Dependency Health Checks: Why /health Is Not Enough

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-24 · via The Practical Developer

The pager went off at 3:17 a.m. The checkout API had a 94% error rate. The pods were all Running. The CPU was at 8%. The liveness probes were green. The readiness probes were green. Every health check in the cluster said the service was fine. The truth was simpler: the Postgres connection pool had exhausted its slots because a background migration job had leaked connections. New requests could not acquire a database handle. The application threw 500s. Kubernetes saw a healthy pod and kept sending traffic.

This is the /health trap. Teams build a route that returns { status: "ok" } and call it done. Kubernetes uses it for readiness. Load balancers use it for target health. Engineers look at it and feel safe. But a process that can execute res.status(200).json({}) tells you almost nothing about whether that process can actually serve a request. The database might be down. Redis might be partitioned. The downstream payment API might be rejecting auth tokens. The queue consumer might be wedged. The health check is blind to all of it.

This post shows how to build dependency-aware health checks: ones that validate the actual resources your API needs. We will look at the three layers of dependency checking, the failure modes you need to distinguish, and the code to implement it without creating new outages. No framework changes. Just honest probes.

What a naive health check actually tells you

Here is the naive version that exists in half the production services I audit:

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

This confirms three things, and three things only: the event loop is not completely frozen, the process has not been OOM-killed, and the HTTP router is mounted. It does not confirm that:

A database connection can be acquired in a reasonable time.
Redis accepts a PING.
The downstream inventory API returns a non-error status.
The file system is writable (if you buffer uploads locally).
The event consumer thread is actually processing messages.

In distributed systems, the most common failures are not process crashes. They are partial failures: a dependency is slow, misconfigured, or rejecting requests. A naive health check is useless against partial failure. Worse, it is dangerous, because it tells your infrastructure that everything is fine when it is not.

The three classes of dependency failures

Not every dependency failure should mark a pod as unhealthy. If you take yourself out of rotation the moment Redis blips, you amplify a small failure into a cascading outage. You need to classify dependencies before you probe them.

Critical dependencies. If this is down, you cannot serve meaningful traffic. For a REST API that reads and writes a relational database, the database is critical. For a video processing service, the object storage backend is critical. If the dependency fails, the pod should fail its readiness probe. Traffic should route elsewhere. If nowhere is healthy, the load balancer returns 503s, which is honest.

Degraded dependencies. If this is down, you can still serve traffic, but some features are unavailable. A caching layer is the classic example. If Redis fails, the API should still serve requests from the database. If analytics telemetry drops, the API should still process checkouts. These should not fail readiness. Instead, they should be monitored, and the service should degrade gracefully.

Best-effort dependencies. These are nice to have, but failures are invisible to users. Think of a metrics push gateway or a non-blocking audit log. These should not affect health checks at all. Probe them for observability, but never let a best-effort dependency evict a pod from the load balancer.

Get this classification wrong and you build a service that falls over because its statsd agent restarted. Get it right and you isolate failures to the blast radius they deserve.

Designing the dependency check

The naive next step is to call every dependency inside the /health handler. This is a mistake. A health check endpoint is queried frequently. Kubernetes probes it every 10 seconds by default. A load balancer might query it every 5 seconds from multiple zones. If you execute a full SELECT 1 on Postgres, a PING to Redis, and a GET to a downstream API on every probe, you generate a significant background load. Worse, if the probe timeout is short (1-2 seconds is common), a transient slowdown in any dependency marks the pod unhealthy even when the dependency would recover a moment later.

A better design uses three techniques.

Background polling with caching. Instead of checking dependencies inline on every HTTP request to /health, run a background task that polls each dependency every few seconds and stores the result in memory. The /health endpoint returns the cached state. This separates probe frequency from dependency check frequency, and it lets you use longer, more realistic timeouts for the actual checks.

Separate readiness and liveness. Kubernetes distinguishes these for a reason. Liveness should mean “this pod is not stuck.” Keep it cheap. Readiness should mean “this pod can serve traffic.” That is where dependency checks belong. If readiness fails, Kubernetes removes the pod from the service endpoints. The pod stays alive so you can inspect logs. If liveness fails, Kubernetes restarts the container. Never put dependency checks on liveness, or a slow database will cause a restart loop.

Timeout discipline. Every dependency check must have a timeout that is shorter than the probe timeout. If your readiness probe timeout is 2 seconds, your Postgres check should timeout in 1 second. If it exceeds that, the dependency is effectively unreachable and the pod should not receive traffic. But if you set the dependency timeout to 5 seconds and the probe timeout to 2 seconds, the probe will always fail on a slow dependency, even when the dependency might recover within 3 seconds.

The code: a production dependency health checker

Here is a Node.js implementation that follows the rules above. It polls critical and degraded dependencies on an interval, caches the results, and exposes separate /health/live and /health/ready endpoints.

import { EventEmitter } from 'node:events';
import pg from 'pg';
import Redis from 'ioredis';

// Configuration: classify your dependencies explicitly.
const DEPENDENCIES = [
  {
    name: 'postgres',
    type: 'critical',
    check: checkPostgres,
    intervalMs: 5_000,
    timeoutMs: 1_500,
  },
  {
    name: 'redis',
    type: 'degraded',
    check: checkRedis,
    intervalMs: 5_000,
    timeoutMs: 1_000,
  },
];

const { Pool } = pg;
const dbPool = new Pool({ connectionString: process.env.DATABASE_URL });
const redis = new Redis(process.env.REDIS_URL);

class DependencyHealthMonitor extends EventEmitter {
  constructor(deps) {
    super();
    this.deps = deps;
    this.state = new Map();
    this.timers = [];
    for (const dep of deps) {
      this.state.set(dep.name, { healthy: true, lastChecked: null, error: null });
    }
  }

  start() {
    for (const dep of this.deps) {
      this._poll(dep);
      const timer = setInterval(() => this._poll(dep), dep.intervalMs);
      this.timers.push(timer);
    }
  }

  stop() {
    for (const t of this.timers) clearInterval(t);
  }

  async _poll(dep) {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), dep.timeoutMs);

    try {
      await dep.check({ signal: controller.signal });
      this._update(dep.name, true, null);
    } catch (err) {
      this._update(dep.name, false, err.message);
    } finally {
      clearTimeout(timeout);
    }
  }

  _update(name, healthy, error) {
    const previous = this.state.get(name);
    if (previous.healthy !== healthy) {
      console.log(JSON.stringify({
        event: 'dependency_health_changed',
        dependency: name,
        healthy,
        previousHealthy: previous.healthy,
        error,
        timestamp: new Date().toISOString()
      }));
      this.emit('change', { name, healthy, error });
    }
    this.state.set(name, { healthy, lastChecked: new Date().toISOString(), error });
  }

  isReady() {
    for (const dep of this.deps) {
      if (dep.type === 'critical' && !this.state.get(dep.name).healthy) {
        return false;
      }
    }
    return true;
  }

  isDegraded() {
    for (const dep of this.deps) {
      if (dep.type === 'degraded' && !this.state.get(dep.name).healthy) {
        return true;
      }
    }
    return false;
  }

  summary() {
    const out = {};
    for (const [name, s] of this.state) {
      out[name] = s;
    }
    return out;
  }
}

// Dependency check functions
async function checkPostgres({ signal }) {
  const client = await dbPool.connect();
  try {
    // Use a lightweight query. Do not SELECT * from a large table.
    await client.query('SELECT 1');
  } finally {
    client.release();
  }
}

async function checkRedis({ signal }) {
  await redis.ping();
}

// Application wiring
const monitor = new DependencyHealthMonitor(DEPENDENCIES);
monitor.start();

monitor.on('change', ({ name, healthy }) => {
  // Integrate with your metrics pipeline here.
  // Example: dependencyHealthyGauge.set({ name }, healthy ? 1 : 0);
});

import express from 'express';
const app = express();

// Liveness: cheap. Just confirms the event loop is responsive.
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness: checks cached dependency state.
app.get('/health/ready', (req, res) => {
  const ready = monitor.isReady();
  const degraded = monitor.isDegraded();
  const statusCode = ready ? (degraded ? 200 : 200) : 503;

  // Some teams prefer to return 200 with a body flag even when degraded.
  // For Kubernetes readiness, 503 removes the pod from the service.
  res.status(statusCode).json({
    status: ready ? (degraded ? 'degraded' : 'ready') : 'not_ready',
    degraded,
    dependencies: monitor.summary(),
  });
});

A few details matter here. The checkPostgres function acquires a connection from the pool, runs SELECT 1, and releases it. It does not reuse a dedicated connection, because a dedicated connection might survive when the pool is exhausted. You want to verify that the pool can actually hand out a handle. The timeout uses AbortController, which you pass down to any async call that supports it. The polling interval is 5 seconds, which means the worst-case delay between a dependency failure and readiness reflecting it is 5 seconds plus the check duration. That is fast enough for most services without being aggressive.

Handling the degraded state gracefully

When a degraded dependency fails, the readiness endpoint still returns 200, so Kubernetes keeps routing traffic. Your application code needs to handle the absence of that dependency without crashing or serving 500s.

For a cache, this means skipping the cache and hitting the primary store:

async function getUser(id) {
  try {
    const cached = await redis.get(`user:${id}`);
    if (cached) return JSON.parse(cached);
  } catch (err) {
    // Log at debug level. Do not throw.
    console.log(JSON.stringify({ event: 'cache_read_failed', error: err.message }));
  }

  const user = await db.query('SELECT * FROM users WHERE id = $1', [id]);
  return user.rows[0];
}

For a non-blocking audit log, wrap the write in a fire-and-forget with a timeout. If it fails, log locally and move on:

async function audit(event) {
  try {
    await Promise.race([
      auditClient.post('/events', event),
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('audit_timeout')), 500)
      ),
    ]);
  } catch (err) {
    console.log(JSON.stringify({ event: 'audit_fallback', payload: event, error: err.message }));
  }
}

The principle is: if a dependency is classified as degraded, every code path that touches it must have a fallback. If you cannot build a fallback, reclassify the dependency as critical. Do not lie to yourself about resilience.

Operational guidance: do not take yourself down

Dependency health checks are powerful, but they introduce a new failure mode: the thundering herd of health checks. If every pod in a 40-replica deployment starts probing Postgres every 5 seconds, you add 8 probes per second to the database. That is usually fine, but during a recovery event, when Postgres is already slow, those probes can make things worse.

Mitigate this with three practices.

Jitter your intervals. Do not start every pod’s polling at the same millisecond. Add a random offset up to the interval on startup:

const jitter = Math.floor(Math.random() * dep.intervalMs);
setTimeout(() => {
  this._poll(dep);
  const timer = setInterval(() => this._poll(dep), dep.intervalMs);
  this.timers.push(timer);
}, jitter);

Use separate credentials or a least-privileged user for probes. If your probe runs SELECT 1, it does not need write access. In extreme cases, give health-check queries their own connection pool with a small cap, so a runaway health check cannot exhaust the application’s main pool.

Watch the watcher. Monitor your health check latency as its own metric. If /health/ready starts taking 500 ms, either your probes are too heavy or your dependencies are under stress. Either way, it is a signal.

Takeaway

A /health endpoint that returns 200 OK is not a health check. It is a process heartbeat. Real health checks validate the resources your service needs to do its job, classify those resources by criticality, and distinguish between “alive but useless” and “alive but slower.” Build background polling, cache the results, wire it to readiness probes, and write fallback code for degraded dependencies. Your 3 a.m. self will thank you.

A note from Yojji

Building resilient APIs means testing failure paths as seriously as happy paths. Yojji helps teams design dependency-aware architectures and implement production-grade health checks that prevent cascading outages. If your monitoring says everything is green while users see errors, it is time to look at how you validate the layers underneath.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

What a naive health check actually tells you

The three classes of dependency failures

Designing the dependency check

The code: a production dependency health checker

Handling the degraded state gracefully

Operational guidance: do not take yourself down

Takeaway

A note from Yojji