Structured Logging With Pino: The 60 Lines That Make Your 3 a.m. Debugging Possible

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-08 · via The Practical Developer

A customer emails you at 03:14 saying “I clicked Pay, got a 500, and was charged anyway.” You open the production logs. You see this:

[2026-05-21 03:14:07] processing request
[2026-05-21 03:14:07] db connected
[2026-05-21 03:14:07] error: cannot read property of undefined
[2026-05-21 03:14:07] processing request
[2026-05-21 03:14:07] db connected

Whose request was that? Which user? Which charge? Was the second request a retry of the first? The string-formatted, timestamped wall of text gives you nothing to grep against. You spend an hour cross-referencing IDs across three log streams and finally find the bug, ten minutes before standup.

The fix is structured logging: JSON logs, one event per line, with a correlation ID that ties every log line for a single request together. Sixty lines of Node.js, one library, and a habit change. Here is exactly what that looks like, why each piece matters, and the five queries that turn it into a debugging superpower.

What “structured” actually means

A structured log entry is a JSON object, not a sentence. Compare:

[2026-05-21 03:14:07] user 4821 charge failed: card declined (amount: 1500)

…to:

{"level":"error","time":1716258847123,"req_id":"01HXYZ...","user_id":4821,"charge_id":"ch_abc","amount_cents":1500,"event":"charge_failed","reason":"card_declined","msg":"charge failed"}

The first is a story you have to read. The second is a record you can query. In a log aggregator, the JSON version lets you write event:charge_failed AND reason:card_declined and get every relevant line back in 200ms. The string version requires either a regex or your eyeballs.

Every observability tool worth using (Datadog, Logtail, Grafana Loki, ClickHouse, even jq against a file) works dramatically better against JSON. The investment to switch is small. The investment to not switch compounds with every incident.

Why Pino over Winston, Bunyan, or `console.log`

Three reasons. First, Pino is roughly 5× faster than Winston for the same output, because it writes JSON synchronously to a stream and offloads anything fancy (rotation, transports, pretty-printing) to a separate process. Second, its API is smaller; there are basically four methods to learn (info, warn, error, child) and one config object. Third, the Pino ecosystem assumes structured logs from day one; child loggers, redaction, and serializers all just work without ceremony.

You can argue Winston has more transports. You can argue Bunyan has the cleanest CLI. In production, the thing that matters is “did the log line get written before the process died, and is it parseable.” Pino wins both, and the maintainer (Matteo Collina, who also maintains parts of Node core) has been steadily de-risking it for years.

Three files: a logger module, a request-context module that carries a correlation ID through the whole request, and the Express wiring. Together, exactly enough to make every log line traceable.

// logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  // Pino's default time is epoch ms. Keep it. Human-readable timestamps
  // are an aggregator concern; epoch is precise and timezone-free.
  base: {
    service: process.env.SERVICE_NAME ?? 'api',
    env: process.env.NODE_ENV ?? 'development',
    version: process.env.GIT_SHA ?? 'dev',
  },
  // Redact common secret fields before they ever hit the JSON stream.
  redact: {
    paths: [
      'req.headers.authorization',
      'req.headers.cookie',
      '*.password',
      '*.token',
      '*.card_number',
    ],
    censor: '[REDACTED]',
  },
  // Production: raw JSON to stdout. Local: pretty-print via pino-pretty.
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true, singleLine: true } }
    : undefined,
});

// request-context.ts
import { AsyncLocalStorage } from 'node:async_hooks';
import { randomUUID } from 'node:crypto';
import type { Logger } from 'pino';

type Context = { reqId: string; logger: Logger };

const storage = new AsyncLocalStorage<Context>();

export function runWithContext<T>(ctx: Context, fn: () => T): T {
  return storage.run(ctx, fn);
}

export function getLogger(): Logger {
  // Falls back to the root logger outside a request, useful for boot logs.
  return storage.getStore()?.logger ?? require('./logger').logger;
}

export function getReqId(): string | undefined {
  return storage.getStore()?.reqId;
}

export function newReqId(): string {
  return randomUUID();
}

// middleware.ts
import type { Request, Response, NextFunction } from 'express';
import { logger } from './logger';
import { runWithContext, newReqId } from './request-context';

export function requestLogger(req: Request, res: Response, next: NextFunction) {
  // Honor an upstream correlation id (load balancer, gateway, calling service).
  const reqId = (req.header('x-request-id') as string) || newReqId();
  res.setHeader('x-request-id', reqId);

  // A child logger inherits the root config and adds reqId/route/method
  // to every line written through it. This is the line that ties the
  // entire request together.
  const child = logger.child({
    req_id: reqId,
    method: req.method,
    route: req.path,
  });

  const start = process.hrtime.bigint();

  res.on('finish', () => {
    const durMs = Number(process.hrtime.bigint() - start) / 1e6;
    child.info(
      { status: res.statusCode, duration_ms: Math.round(durMs) },
      'request completed',
    );
  });

  runWithContext({ reqId, logger: child }, () => next());
}

Wire it once at the top of the chain:

import express from 'express';
import { requestLogger } from './middleware';
import { getLogger } from './request-context';

const app = express();
app.use(express.json());
app.use(requestLogger);

app.post('/api/charges', async (req, res) => {
  const log = getLogger(); // child logger, already has req_id baked in
  log.info({ user_id: req.body.user_id, amount: req.body.amount }, 'charge_attempt');

  try {
    const charge = await createCharge(req.body);
    log.info({ charge_id: charge.id }, 'charge_succeeded');
    res.json(charge);
  } catch (err) {
    log.error({ err, user_id: req.body.user_id }, 'charge_failed');
    res.status(500).json({ error: 'internal_error' });
  }
});

That is the whole change. Three small modules, one middleware, and a habit of calling getLogger() inside every handler instead of importing the root logger.

Why each piece is there

A few of these decisions are easy to skip and will quietly bite you later.

AsyncLocalStorage instead of passing logger as a parameter. You could thread the logger through every function call, and some teams do. In practice, that breaks the moment a downstream library calls back into your code (a Postgres callback, an event emitter, a setTimeout). AsyncLocalStorage follows the async chain automatically, even into setImmediate, Promise.then, and most well-behaved libraries. The cost is roughly 2% CPU at high request rates, which is far cheaper than the alternative (logs without correlation IDs, or every function having a log parameter).

req_id as a UUID, not a counter. A monotonic counter is tempting because it is small. It also collides the moment you scale to multiple pods. UUIDs are 36 characters and cost nothing meaningful in storage. randomUUID() from node:crypto is fast enough that you will not measure it.

Honoring an upstream x-request-id header. When a request arrives from your load balancer, API gateway, or another internal service, that caller almost certainly already generated a correlation ID. Use it. The whole point of distributed tracing is that one ID flows from the edge all the way to the database. Generating a fresh one in your service breaks the chain and forces you to JOIN logs across services on timestamps and user IDs, which is exactly the work you are trying to eliminate.

Setting x-request-id on the response. Now your error responses can include the ID, your frontend can surface it to users (“Error: contact support with reference 01HXYZ…”) and your support team can paste a single string into the log search bar. Tickets that used to take twenty minutes to localize take ten seconds.

Redaction, not “remember to not log secrets.” People log auth headers by accident, usually by serializing the entire request object during a hard-to-reproduce bug. Pino’s redact option is applied at the serialization layer; it does not depend on every developer remembering. Add password, token, card_number, ssn, and anything else that should not be in your log retention. The wildcard syntax (*.password) catches it at any depth.

A service, env, and version on every line. If you operate more than one service, you will eventually pipe all their logs into one stream. Without those three fields, you cannot tell which one emitted the line. Add them once in base and they show up on every log entry for free.

event as a snake_case constant. This is a convention, not a code change, but it pays back forever. Pick a vocabulary (charge_attempt, charge_succeeded, charge_failed, webhook_received, db_pool_exhausted) and use those exact strings as the event field. Now you can write dashboards on top of them (“count of charge_failed per minute, grouped by reason”) instead of grepping log message strings that someone will reword next quarter.

Once the fields are consistent, the queries are stupidly powerful. These work in any decent log aggregator (Datadog, Logtail, Grafana, even jq against a flat file).

1. The full story of one request. A customer pastes the x-request-id from the error page they screenshotted:

req_id:01HXYZ7K2P...

Every log line for that request (middleware, handler, database, downstream service) comes back ordered. No JOINs.

2. All errors in the last hour, grouped by event.

level:error AND time:[now-1h TO now] | stats count by event

You instantly see whether the spike is concentrated in one code path (a deploy regression, usually) or spread evenly across the service (an upstream outage, usually).

3. P99 duration per route.

event:request_completed | percentile(duration_ms, 99) by route

This is your own slow-endpoint dashboard, free, with a one-line config in any aggregator. No APM agent required.

4. The “weird user” filter. Every customer support ticket about a charging bug should resolve to:

user_id:4821 AND time:[2026-05-21 TO 2026-05-22] | sort by time

The reason this matters: when a single user reports a bug, half the time it is actually two bugs hiding behind the same complaint. The full timeline tells you how many.

5. Pre-incident baseline. Run weekly and stash the result:

level:warn OR level:error | stats count by event, service

If db_pool_exhausted rises from 4/week to 800/week, that is your incident, and you have caught it three days before it page-storms on-call.

Common mistakes and the fixes

A few things go wrong in practice that the toy version does not show.

Logging objects with circular references. Pino handles most of these, but a Mongoose document or a Bull queue job can have hidden circular structures that produce [Object] placeholders. Fix it once with a custom serializer (serializers: { user: (u) => ({ id: u.id, email: u.email }) }) and the field becomes both safe and consistent.

Log shipping that drops lines under load. If you pipe stdout to a sidecar, the sidecar’s buffer will overflow during a traffic spike. Pino writes synchronously by default, which protects ordering but blocks the event loop on a slow stdout. Under bursty load, switch to pino.transport({ target: 'pino/file', options: { destination: '/proc/1/fd/1' } }) or use Vector/Fluent Bit with a backpressure-aware shipper. The defaults are fine until you measure dropped lines; the moment you measure them, fix it.

Pretty-printing in production. A common mistake: developers love pino-pretty so much they leave it on in production. It uses ~3× more CPU than raw JSON and prevents your aggregator from parsing fields. Keep transport conditional on NODE_ENV, never on a feature flag.

Logging at info for things that should be debug. The temptation to log every database query at info is strong. Resist it. info is for events that matter to a human reading the timeline of a request: “request started,” “charge succeeded,” “external API responded.” debug is for noise you turn on per-environment. If your info line count grows linearly with traffic and you cannot afford the storage, every team eventually disables logging entirely, which is worse than verbose logging.

Logging the same event in multiple places. Pick one place per event. If you log charge_succeeded in the handler and again in the database hook and again in a downstream emitter, your dashboards get triple-counted and your ratio queries lie. The convention “one event field per logical event, emitted once” pays back forever.

Forgetting to log the err field correctly. Pino has a built-in serializer for objects under the err key that captures message, stack, name, and any custom fields. Always pass errors as log.error({ err, ...context }, 'message'), never as log.error('failed: ' + err.message). The first version gives you a structured error object you can err.code:ECONNRESET-query against. The second is back to grepping strings.

The metric that proves it shipped

The number to track is “median time from incident report to first useful log query.” Before structured logging, that number is between fifteen minutes and two hours, depending on how big the haystack is and how good your search skills are. After structured logging plus correlation IDs, it should be under sixty seconds: paste the req_id from the error page, sort by time, read the timeline.

If you cannot measure that directly, a decent proxy is “percentage of incidents where the postmortem cites a specific log line versus general log behavior.” Before, most postmortems say “we saw elevated 5xx,” which is vague. After, they say “at 03:14:07 the request 01HXYZ... failed in charge_failed with reason card_declined_after_capture, which then caused the duplicate write.” That level of specificity is what makes a postmortem actionable.

The takeaway

Most teams put off structured logging because it feels like a refactor. It isn’t. It is one new dependency, three small modules, and a one-week period of replacing console.log with getLogger() whenever you touch a file. After that, every incident gets faster to debug, every dashboard gets easier to build, and the next time a customer pastes you a request ID, you find the bug before they finish typing the follow-up email.

The version of this you build under pressure, the night of an outage, is always worse than the version you build on a quiet Tuesday. Build it on the Tuesday.

A note from Yojji

The kind of plumbing this post describes (observability that holds up under real load, request-scoped context that survives the messy parts of an async stack, dashboards that surface the right number on the right day) is the unglamorous middle of every reliable backend. It is also the kind of work Yojji has been shipping since 2016.

Yojji is an international custom software development company with offices across Europe, the US, and the UK. Their teams specialize in the JavaScript stack (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architecture. They run dedicated, senior outstaffed teams for long-running engagements, plus full-cycle product work covering discovery, design, development, QA, and DevOps.

If your team would rather hire the practice of building observable, debuggable services than learn it the hard way after a 3 a.m. page, Yojji is worth a conversation.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

What “structured” actually means

Why Pino over Winston, Bunyan, or console.log

Why each piece is there

Common mistakes and the fixes

The metric that proves it shipped

The takeaway

A note from Yojji

Why Pino over Winston, Bunyan, or `console.log`