Node.js Child Processes: Spawn, Errors, Orphans, and Supervision in Production

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-06-03 · via The Practical Developer

The image conversion service crashed three times before anyone noticed. Not the Express server. That stayed up. The child_process.fork() that processed uploaded images silently exited when Sharp hit a corrupt JPEG. No error in the parent. No restart. The queue filled up. Users uploaded, the upload returned 200, and the resized thumbnail never appeared. By the time the monitoring caught the 4,000-image backlog, the damage was done.

Node.js child_process is one of the most commonly misused APIs in production. It looks simple: call exec(), get output, move on. But the defaults are designed for interactive shells, not long-running servers. Output buffers fill up and block the child. Exit codes go unchecked. Orphan processes accumulate when the parent crashes. Stderr is ignored until the child has already failed.

This post covers four patterns that turn child_process from a footgun into a reliable subsystem: correct spawning with backpressure, error handling that covers every failure mode, orphan prevention with process groups, and a supervisor pattern for long-lived workers.

Pattern 1: spawn, not exec

The most common mistake is using exec() to run a shell command that produces output.

import { exec } from 'child_process';

// Looks harmless. Is not.
const { stdout, stderr } = await execAsync('ffmpeg -i input.mp4 output.webm');

exec() buffers stdout and stderr into strings. The default buffer size is 1024 kilobytes. If your ffmpeg output (or any command output) exceeds that, the child process blocks when its pipe buffer fills up, the write system call hangs, and your command never finishes. This is not theoretical. It happens with image processors, video transcoders, database dumps, and anything that produces more than a megabyte of output.

The fix is spawn() with streaming stdio:

import { spawn } from 'node:child_process';
import { Readable, Writable } from 'node:stream';

interface SpawnResult {
  stdout: string;
  stderr: string;
  exitCode: number | null;
  signal: string | null;
}

function spawnCollect(
  command: string,
  args: string[],
  options?: { timeout?: number; maxBuffer?: number }
): Promise<SpawnResult> {
  return new Promise((resolve, reject) => {
    const child = spawn(command, args, {
      stdio: ['ignore', 'pipe', 'pipe'],
      timeout: options?.timeout ?? 30_000,
    });

    const stdoutChunks: Buffer[] = [];
    const stderrChunks: Buffer[] = [];
    let stdoutSize = 0;
    let stderrSize = 0;
    const maxBuffer = options?.maxBuffer ?? 10 * 1024 * 1024; // 10 MB default

    child.stdout!.on('data', (chunk: Buffer) => {
      stdoutSize += chunk.length;
      if (stdoutSize > maxBuffer) {
        child.kill();
        reject(new Error(`stdout exceeded maxBuffer (${maxBuffer} bytes)`));
        return;
      }
      stdoutChunks.push(chunk);
    });

    child.stderr!.on('data', (chunk: Buffer) => {
      stderrSize += chunk.length;
      if (stderrSize > maxBuffer) {
        child.kill();
        reject(new Error(`stderr exceeded maxBuffer (${maxBuffer} bytes)`));
        return;
      }
      stderrChunks.push(chunk);
    });

    child.on('error', (err) => {
      reject(err);
    });

    child.on('close', (exitCode, signal) => {
      resolve({
        stdout: Buffer.concat(stdoutChunks).toString('utf-8'),
        stderr: Buffer.concat(stderrChunks).toString('utf-8'),
        exitCode,
        signal,
      });
    });
  });
}

This gives you streaming reads with backpressure (the OS pipe buffer drains as you read), explicit max-buffer enforcement that kills the child instead of hanging, and separate access to stdout and stderr.

Key differences from exec():

Stdio is piped, not buffered. Node reads from the OS pipe in chunks, so the child never stalls on a full pipe.
You control the buffer limit. Pick a value that fits your workload. Logs: 1 MB. Video frames: 100 MB. Kill the child if it exceeds.
You get the exit code AND the signal. A child killed by SIGTERM (signal: 'SIGTERM') is different from one that exits with code 1.

Use exec() only for trivial commands where you control the input and the output is known to be small (under 10 KB). For everything else, spawn() with explicit stdio handling.

Pattern 2: handle every failure mode

A child process can fail in five distinct ways, and you need to handle all of them:

Command not found — spawn() throws an ENOENT error.
Permission denied — spawn() throws an EACCES error.
Non-zero exit code — The process ran but returned a failure code.
Killed by signal — The OS or another process terminated it.
Timeout — The process ran longer than expected.

Most implementations handle only #3. Here is the complete handler:

interface ProcessResult {
  stdout: string;
  stderr: string;
  ok: boolean;
  code: number | null;
  signal: string | null;
}

async function runProcess(
  command: string,
  args: string[],
  options?: { timeout?: number }
): Promise<ProcessResult> {
  return new Promise((resolve) => {
    const child = spawn(command, args, {
      stdio: ['ignore', 'pipe', 'pipe'],
      timeout: options?.timeout ?? 30_000,
    });

    let stdout = '';
    let stderr = '';

    child.stdout!.setEncoding('utf-8');
    child.stderr!.setEncoding('utf-8');
    child.stdout!.on('data', (d) => { stdout += d; });
    child.stderr!.on('data', (d) => { stderr += d; });

    child.on('error', (err: NodeJS.ErrnoException) => {
      // ENOENT, EACCES, etc. The child never started.
      resolve({
        stdout,
        stderr: `Failed to spawn: ${err.message}`,
        ok: false,
        code: err.code === 'ENOENT' ? 127 : 126,
        signal: null,
      });
    });

    child.on('close', (code, signal) => {
      const ok = code === 0 && signal === null;
      resolve({ stdout, stderr, ok, code, signal });
    });
  });
}

The error event fires when the child cannot start. The close event fires when the child exits. Both can fire (ENOENT triggers error then close with code null). The close event alone handles normal exits, signal kills, and timeouts (Node kills with SIGTERM on timeout, which fires close with a signal).

One gotcha: the exit event fires before close. Use close instead of exit because close guarantees all stdio streams have finished. If you use exit, you might read partial output.

Pattern 3: prevent orphans with process groups

When your Node.js process crashes, any child processes it spawned become orphans. The OS reparents them to init (PID 1), and they keep running. This is how production incidents get compound: the parent OOMs, but the twelve ffmpeg children it spawned continue consuming CPU and memory on the same host.

The fix is to launch children in a process group and kill the group when the parent dies.

import { spawn, execSync } from 'node:child_process';

function spawnWithGroup(command: string, args: string[]): ReturnType<typeof spawn> {
  const child = spawn(command, args, {
    stdio: ['ignore', 'pipe', 'pipe'],
    detached: false,   // Keeps child in the parent's process group
    // On Linux, use setsid to create a new session so we can kill the group
    // On Windows, use taskkill /T
  });

  return child;
}

// Graceful cleanup
function killProcessGroup(child: ReturnType<typeof spawn>): void {
  if (child.pid === undefined) return;

  if (process.platform === 'win32') {
    execSync(`taskkill /PID ${child.pid} /T /F`, { stdio: 'ignore' });
  } else {
    // Negative PID sends signal to the process group
    try {
      process.kill(-child.pid, 'SIGTERM');
    } catch {
      // Process group may already be dead
    }
  }
}

// Use with exit handlers
function setupOrphanPrevention(child: ReturnType<typeof spawn>): void {
  const cleanup = () => {
    killProcessGroup(child);
  };

  process.on('SIGTERM', cleanup);
  process.on('SIGINT', cleanup);
  process.on('SIGHUP', cleanup);
  process.on('beforeExit', cleanup);

  // Remove listeners when child exits
  child.on('close', () => {
    process.removeListener('SIGTERM', cleanup);
    process.removeListener('SIGINT', cleanup);
    process.removeListener('SIGHUP', cleanup);
    process.removeListener('beforeExit', cleanup);
  });
}

The key detail is the negative PID in process.kill(-child.pid, 'SIGTERM'). On POSIX systems, a negative PID sends the signal to every process in the process group. If your child spawns its own children (ffmpeg does, so does make), they all get killed.

Important caveat: This only works if the parent is still alive when SIGTERM arrives. If the parent crashes (uncaught exception, segfault inside the runtime), the exit handlers never run. For that scenario, run your Node.js process with a process supervisor (systemd, supervisord, or Docker with --init) so the container runtime handles orphan reaping. Or use the detached: true option and manage the child PID explicitly.

If you are on Linux and want maximum protection, set the child’s PR_SET_PDEATHSIG — the kernel kills the child when the parent dies, no matter how the parent dies:

import { spawn } from 'node:child_process';

const child = spawn('node', ['worker.js'], {
  stdio: ['pipe', 'pipe', 'pipe'],
  // Pre-exec function only available in Node >= 16
});

// Alternative: use a wrapper that sets PDEATHSIG
// Only works on Linux
const childWithDeathSig = spawn('sh', ['-c', `
  prctl --death 9
  exec "$@"
`, '--', command, ...args]);

For a portable solution that handles crashes too, use a monitoring process (pattern 4) that watches both the parent and children, or run your service in Docker with init: true in your Compose file, which runs an init process as PID 1 that reaps orphans.

Pattern 4: the supervisor pattern for long-lived workers

When you fork() a worker process to handle CPU-bound work, you need more than just spawning it. You need supervision: restart on crash, backoff on repeated crashes, and health checks.

Here is a worker supervisor that handles all three:

import { fork, ChildProcess } from 'node:child_process';
import { EventEmitter } from 'node:events';

interface SupervisorOptions {
  modulePath: string;
  args?: string[];
  env?: Record<string, string>;
  maxRestarts?: number;
  restartDelay?: number;       // Base delay in ms
  healthCheckInterval?: number;
}

class WorkerSupervisor extends EventEmitter {
  private child: ChildProcess | null = null;
  private restartCount = 0;
  private healthCheckTimer: ReturnType<typeof setInterval> | null = null;
  private stopped = false;

  constructor(private options: SupervisorOptions) {
    super();
    this.options.maxRestarts ??= 10;
    this.options.restartDelay ??= 1000;
    this.options.healthCheckInterval ??= 15_000;
  }

  start(): void {
    this.stopped = false;
    this.spawn();
  }

  stop(): void {
    this.stopped = true;
    this.clearHealthCheck();
    if (this.child) {
      this.child.kill('SIGTERM');
      this.child = null;
    }
  }

  private spawn(): void {
    if (this.stopped) return;

    this.child = fork(this.options.modulePath, this.options.args, {
      env: { ...process.env, ...this.options.env },
      stdio: ['pipe', 'pipe', 'pipe'],
    });

    this.child.stdout?.pipe(process.stdout);
    this.child.stderr?.pipe(process.stderr);

    this.child.on('message', (msg: unknown) => {
      this.emit('message', msg);
      // Reset restart count on any message (worker is alive and working)
      this.restartCount = 0;
    });

    this.child.on('exit', (code, signal) => {
      this.child = null;
      this.clearHealthCheck();

      if (this.stopped) return;

      const unexpected = code !== 0 || signal !== null;
      if (unexpected) {
        this.restartCount++;
        this.emit('crashed', { code, signal, restartCount: this.restartCount });

        if (this.restartCount > this.options.maxRestarts!) {
          this.emit('exhausted', {
            message: `Worker crashed ${this.restartCount} times. Giving up.`,
          });
          return;
        }

        const delay = this.options.restartDelay! * Math.pow(2, this.restartCount - 1);
        this.emit('restarting', { delay, attempt: this.restartCount });
        setTimeout(() => this.spawn(), delay);
      }
    });

    this.child.on('error', (err) => {
      this.emit('error', err);
    });

    this.startHealthCheck();
  }

  private startHealthCheck(): void {
    this.clearHealthCheck();
    this.healthCheckTimer = setInterval(() => {
      if (this.child && !this.child.killed) {
        this.child.send({ type: 'ping' });
        // If no response within timeout, kill and restart
        const timeout = setTimeout(() => {
          this.emit('unresponsive');
          this.child?.kill('SIGKILL');
        }, 5000);
        this.child.once('message', (msg: unknown) => {
          clearTimeout(timeout);
          if ((msg as { type?: string }).type === 'pong') {
            this.restartCount = 0; // Healthy response resets throttle
          }
        });
      }
    }, this.options.healthCheckInterval);
  }

  private clearHealthCheck(): void {
    if (this.healthCheckTimer) {
      clearInterval(this.healthCheckTimer);
      this.healthCheckTimer = null;
    }
  }
}

Usage:

const supervisor = new WorkerSupervisor({
  modulePath: './image-worker.js',
  maxRestarts: 5,
  restartDelay: 500,
  healthCheckInterval: 10_000,
});

supervisor.on('crashed', ({ code, signal, restartCount }) => {
  console.error(`Worker crashed (code=${code}, signal=${signal}), restart ${restartCount}`);
});

supervisor.on('exhausted', ({ message }) => {
  console.error(message);
  // Alert PagerDuty, send to metrics, etc.
});

supervisor.on('unresponsive', () => {
  console.warn('Worker did not respond to health check. Forcing restart.');
});

supervisor.start();

And the worker (image-worker.js) needs to handle the health-check protocol:

process.on('message', (msg: { type?: string }) => {
  if (msg.type === 'ping') {
    process.send!({ type: 'pong' });
  }
});

The supervisor does three things the naive approach misses:

Exponential backoff on restarts. If the worker crashes repeatedly (config error, corrupted input), it stops trying after maxRestarts instead of restarting forever in a tight loop.
Health checks. A worker can be alive but stuck (infinite loop, deadlock). The supervisor sends a ping and expects a pong within 5 seconds. No response means SIGKILL.
Restart count reset on success. If the worker processes a message successfully, the counter resets. This prevents transient failures (OOM from a single large image) from accumulating into a permanent blacklist.

The practical takeaway

Here is the rule of thumb for choosing a child process API:

Task	API	Why
Run a short command with tiny output (under 10 KB)	`exec()`	Convenient, but only for known-small output.
Run any command with unknown or large output	`spawn()` + streaming pipes	No buffer deadlock.
Fork a Node.js module as a worker	`fork()` + supervisor pattern	Built-in IPC, health checks, restart backoff.
Run a daemon or background service	`spawn()` with `detached: true`	Process group isolation.

And the checklist for every child_process usage in production:

Are stdout and stderr piped as streams, not buffered strings?
Is there a max buffer limit that kills the child if exceeded?
Are both the error event (spawn failure) and close event (exit) handled?
Is a non-zero exit code treated as an error?
Are orphan children cleaned up when the parent exits or crashes?
If the child is long-lived, is there a health check and restart policy?

The image conversion service I mentioned at the start? The fix was a 40-line supervisor with streaming stdio, a max buffer of 50 MB (images are big), and a health check that caught the corrupt-JPEG case within seconds. The crash loop that had filled 4,000 queue items over six hours was caught within 30 seconds on the next deployment. The code in this post is that fix, extracted and generalized.

A note from Yojji

The kind of work this post describes (handling every failure mode from ENOENT to SIGKILL, designing process supervision with backoff, and preventing orphans in production) is the unglamorous infrastructure engineering that separates services that recover from ones that compound failures. It is exactly the kind of production-aware backend craft that Yojji’s teams build into the systems they ship.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and full-cycle product engineering covering discovery, design, development, QA, and DevOps. If your team would rather hire the practice of building reliable, well-instrumented process architectures than learn it the hard way during a silent queue buildup, Yojji is worth a conversation.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Pattern 1: spawn, not exec

Pattern 2: handle every failure mode

Pattern 3: prevent orphans with process groups

Pattern 4: the supervisor pattern for long-lived workers

The practical takeaway

A note from Yojji