The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2024-05-24 · via The Practical Developer

The team merges a refactor of the payment flow. CI passes. The deploy pipeline runs. Five minutes later, every user trying to check out hits a 500 error. Ten minutes after that, the team has rolled back, but $20k of orders are lost and customer support is dealing with refund requests. The bug only manifests under real production traffic; load testing didn’t catch it because the failure is on a specific edge case.

The fix is not “test more carefully.” Some bugs only surface in production, with real users. The fix is to limit the blast radius of the inevitable bad deploy by rolling out gradually: 1% of users first, then 10%, then 50%, then 100%. A bug at 1% is a small fire. A bug at 100% is a postmortem.

This post is the five-stage rollout pattern, the metrics that gate each stage, the tooling, and the rollback procedure that has to actually work.

The five stages

Stage 0: Internal users only        (5-10 employees, typically 0-1 days)
Stage 1: 1% of production traffic   (1-2 days)
Stage 2: 10% of production traffic  (1-2 days)
Stage 3: 50% of production traffic  (1-2 days)
Stage 4: 100% of production traffic (the feature is launched)
Cleanup: Remove the flag and the old code path

The total time from first deploy to 100% is typically 4-7 days for a non-critical feature, longer for high-risk changes (payments, auth).

The mechanism is a feature flag. Code path A is the old behavior, code path B is the new. The flag percentage controls how many users get B.

if (await flags.isEnabled('new-checkout-flow', { user })) {
  return renderNewCheckout();
}
return renderOldCheckout();

Both code paths are in production simultaneously. The flag is what selects which one runs.

What gates each stage

Don’t promote stages on a timer alone. Gate them on observed metrics:

Stage 0 → 1: Internal usage shows no errors for 24 hours. The team has manually tested the happy path and a few edge cases.

Stage 1 → 2: At 1% for 24-48 hours:

Error rate for the new path is not higher than the old path.
p95 latency for the new path is comparable.
No customer support tickets attributed to the change.

Stage 2 → 3: Same checks at 10%. Plus: any rare failure modes that need volume to surface have had time to appear.

Stage 3 → 4: Same at 50%. By this point, you have high confidence the change is good.

Cleanup: After 100% holds for 1 week, remove the old code and the flag (see the feature-flags post).

The metrics worth gating on

A small set covers most use cases:

Error rate of the affected endpoints.
p95 / p99 latency of the affected endpoints.
Conversion rate for revenue-relevant flows (checkout, signup).
Background-job success rate if the change touches async processing.
Customer support ticket volume in relevant categories.

Compare each metric for users in the new variant vs users in the old variant, not absolute. A 10% error rate is bad universally; the question is “did this change make it worse?”

How to bucket users

Three options:

1. Random by user ID. Stable bucketing: a user who got the new flow on Monday gets it on Tuesday too. The right default. Use a hash of (user_id, flag_name).

2. By cohort. Internal employees first, then beta opt-ins, then specific customer segments, then everyone.

3. By geographic region. Deploy to one region first to limit blast radius further.

For most rollouts, (1) is sufficient. (2) and (3) layer on top for extra-risky changes.

The rollback procedure

A rollout is only as good as its rollback. The procedure must be:

Fast. Sub-minute. A bad deploy at 50% needs to be rolled back in seconds, not “let me find the playbook.”
Reversible. Old code path still in production. Flipping the flag back to 0% reverts behavior.
Tested. Verify the rollback works before you need it. Flip the flag back to 0% in staging; confirm old behavior returns.

The single command:

flagsctl set new-checkout-flow --percentage 0

Or click a button in LaunchDarkly. Either way, sub-minute. Document the runbook, including who has access.

For most teams, buying a feature-flag service is the right call:

LaunchDarkly: most mature, most expensive.
Statsig: strong experimentation focus.
Unleash: open-source, self-hostable.
PostHog: flags + analytics + session replay.
GrowthBook: open-source, focused on experiments.

For very small teams, a Postgres-backed flag table works:

CREATE TABLE feature_flags (
  name        text PRIMARY KEY,
  percentage  int NOT NULL DEFAULT 0,
  updated_at  timestamptz NOT NULL DEFAULT now()
);

async function isEnabled(name: string, user: { id: string }): Promise<boolean> {
  const { rows } = await pool.query(
    'SELECT percentage FROM feature_flags WHERE name = $1', [name]);
  if (!rows[0]) return false;
  const hash = murmurHash(user.id + name) % 100;
  return hash < rows[0].percentage;
}

Add caching (30s TTL) and you have a working flag system in 30 lines. For under ~20 flags, this is fine. Past that, buy.

Common pitfalls

1. Sticky bucketing not implemented. A user sees the old flow on one request and the new flow on the next. Confusing UX, broken UX in some cases. Always bucket by (user_id, flag_name) hash, deterministic.

2. The new and old paths share state in incompatible ways. A new flow writes data the old flow can’t read. When you roll back, users on the new flow are stranded. Design changes so old and new are mutually compatible.

3. Comparing metrics globally instead of per-variant. Total error rate is at 0.5%; you don’t notice that error rate for new-flow users is 5%. Always slice metrics by variant.

4. Skipping stages under deadline pressure. “We have to ship by Friday, let’s go straight to 50%.” That is exactly the situation that produces the postmortem. Stages exist to prevent the rare bad outcome; the rare bad outcome is exactly when you’d be tempted to skip them.

5. Forgetting to clean up. The flag is at 100% for 6 months and the old code is still in the codebase. Set a calendar reminder to delete the flag two weeks after 100%.

A different shape: dark launches

For very risky changes (database migrations, major refactors), “dark launch” first:

Run the new code path in production but discard its output.
Compare the new path’s behavior to the old path’s behavior in real-time.
Only after they agree consistently, switch to actually using the new path.

For example:

async function chargeCustomer(orderId: string) {
  const oldResult = await chargeCustomerOldFlow(orderId);

  if (await flags.isEnabled('dark-launch-new-charge')) {
    try {
      const newResult = await chargeCustomerNewFlowDryRun(orderId);
      logComparison({ orderId, old: oldResult, new: newResult });
    } catch (err) {
      logDarkLaunchError({ orderId, err });
    }
  }

  return oldResult;
}

Production behavior is unchanged. New code is exercising real data. Discrepancies surface without affecting users.

Beyond five stages

For the most critical paths (payment, auth), more granular stages help:

Internal → 0.1% → 1% → 5% → 10% → 25% → 50% → 100%

Eight stages, each at the prior level for 24-48 hours. A change that passes all of these is genuinely battle-tested.

Conversely, for trivial changes (a copy update, a bug fix in non-critical code), three stages may be enough:

Internal → 50% → 100%

Match the rollout granularity to the risk.

The takeaway

A staged rollout converts inevitable bugs from disasters into observations. 1%, 10%, 50%, 100%, each gated on metrics, with a fast rollback. Bucket users stably. Compare per-variant metrics, not totals. Clean up flags after they’ve been at 100%.

The team that adopts this finds that “we shipped a bad deploy” stops meaning “all customers were affected” and starts meaning “1% of customers saw a transient bug for two hours.” That is the difference between a good engineering culture and a fire-fighting one.

A note from Yojji

The kind of release-engineering discipline that turns “we hope this works” into “we measured at each stage and it worked” (staged rollouts, rollback procedures, per-variant metrics) is the kind of long-haul engineering practice Yojji’s teams build into the products they ship for clients.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and full-cycle product engineering, including the rollout and deployment practices that decide whether shipping is risky or routine.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

The five stages

What gates each stage

The metrics worth gating on

How to bucket users

The rollback procedure

Common pitfalls

A different shape: dark launches

Beyond five stages

The takeaway

A note from Yojji