Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-06 · via The Practical Developer

A Stripe webhook fires. Your server processes it, charges the customer, returns a 200. Stripe never gets the 200 because of a 50ms TCP blip. Stripe retries. You charge the customer again.

That is the entire story behind a class of bug that lives in nearly every Node.js codebase that talks to a queue, a webhook provider, or any system with at-least-once delivery. The fix is not “be more careful.” The fix is idempotency keys, and it is roughly 30 lines of middleware plus one Postgres table. Here is exactly what the code looks like, why each line is there, and how to convince yourself the fix actually works.

What “idempotent” really means here

An endpoint is idempotent when calling it N times with the same input produces the same effect as calling it once. GET is idempotent by accident. POST /charges is not, unless you make it that way.

The trick is that “the same input” is doing a lot of work in that sentence. Two webhook deliveries with identical bodies are the same input. Two browser clicks of “Pay” two seconds apart that happen to send identical bodies are also the same input, and you almost certainly do not want to treat them as one charge.

So the contract has to be: the caller picks a key, and that key tells you “this is the same logical operation as the one I sent before.” For webhooks the provider supplies it (Stripe-Signature includes a delivery ID; svix-id for Svix; X-GitHub-Delivery for GitHub). For your own clients, generate a UUID per user action and send it as Idempotency-Key. Stripe’s own API works exactly this way for a reason.

The shape of the bug

Before the fix, here is the kind of log you find when a customer support ticket says “I was charged twice”:

12:04:17.812  POST /webhooks/stripe  evt_1NxYz7  charge.succeeded  -> 200 (1840ms)
12:04:19.701  POST /webhooks/stripe  evt_1NxYz7  charge.succeeded  -> 200 (1612ms)

Same event ID. Two successful processings. Two rows in your payouts table. Two emails sent. One refund to issue.

The first request’s response was slow enough that Stripe’s HTTP client timed out and retried. Your handler had already committed the side effects when the retry arrived. Without an idempotency check, the retry happily re-runs everything.

The 30 lines

The middleware below does four things in order: hash the request, look up the key in Postgres with SELECT ... FOR UPDATE, return the cached response if we have already processed this key, or insert a placeholder row and run the handler. Every step matters.

// idempotency.ts
import { createHash } from 'node:crypto';
import type { Request, Response, NextFunction } from 'express';
import { pool } from './db';

const TTL_HOURS = 24;

export function idempotency(headerName = 'idempotency-key') {
  return async (req: Request, res: Response, next: NextFunction) => {
    const key = req.header(headerName);
    if (!key) return next(); // opt-in; only enforce when client supplies a key

    const fingerprint = createHash('sha256')
      .update(req.method + req.path + JSON.stringify(req.body))
      .digest('hex');

    const client = await pool.connect();
    try {
      await client.query('BEGIN');

      const existing = await client.query(
        `SELECT status, response_code, response_body, request_fingerprint
           FROM idempotency_keys
          WHERE key = $1
          FOR UPDATE`,
        [key],
      );

      if (existing.rowCount) {
        const row = existing.rows[0];
        await client.query('COMMIT');

        if (row.request_fingerprint !== fingerprint) {
          return res.status(422).json({ error: 'idempotency_key_reuse' });
        }
        if (row.status === 'in_progress') {
          return res.status(409).json({ error: 'request_in_progress' });
        }
        return res.status(row.response_code).json(row.response_body);
      }

      await client.query(
        `INSERT INTO idempotency_keys (key, request_fingerprint, status, expires_at)
         VALUES ($1, $2, 'in_progress', now() + interval '${TTL_HOURS} hours')`,
        [key, fingerprint],
      );
      await client.query('COMMIT');
    } catch (e) {
      await client.query('ROLLBACK');
      client.release();
      return next(e);
    }
    client.release();

    // Capture the handler's response and persist it before the socket closes.
    const originalJson = res.json.bind(res);
    res.json = (body: unknown) => {
      pool.query(
        `UPDATE idempotency_keys
            SET status = 'completed',
                response_code = $2,
                response_body = $3
          WHERE key = $1`,
        [key, res.statusCode, body],
      ).catch((err) => console.error('[idempotency] persist failed', err));
      return originalJson(body);
    };

    next();
  };
}

Wire it in front of the handlers that mutate things:

import express from 'express';
import { idempotency } from './idempotency';

const app = express();
app.use(express.json());

// Use the webhook provider's delivery id as the key.
app.post('/webhooks/stripe',
  idempotency('stripe-signature'), // or a parsed delivery-id header
  stripeWebhookHandler,
);

// For first-party clients, expect them to send Idempotency-Key.
app.post('/api/charges',
  idempotency('idempotency-key'),
  createChargeHandler,
);

And the schema, because the table design is doing real work:

CREATE TABLE idempotency_keys (
  key                 text PRIMARY KEY,
  request_fingerprint text NOT NULL,
  status              text NOT NULL CHECK (status IN ('in_progress', 'completed')),
  response_code       int,
  response_body       jsonb,
  created_at          timestamptz NOT NULL DEFAULT now(),
  expires_at          timestamptz NOT NULL
);

CREATE INDEX idempotency_keys_expires_at_idx ON idempotency_keys (expires_at);

That is the whole change. Schema, middleware, two app.use lines.

Why each piece is there

A few of these decisions look optional but are not.

SELECT ... FOR UPDATE inside a transaction. Two simultaneous deliveries arrive at the same millisecond. Without a row-level lock, both SELECTs return zero rows and both INSERTs succeed (one of them gets a unique-violation, fine, but both have already started running the handler in parallel). The FOR UPDATE on the parent transaction blocks the second request until the first one has either inserted the placeholder row or rolled back. This is the line that turns an at-least-once stream into an exactly-once handler.

Storing a request_fingerprint. Idempotency keys can be reused incorrectly, most often by clients that retry after editing the payload. Your contract should be: same key, same body, or it is a 422. Without the fingerprint check, a client could send an updated charge amount under the same key and silently get the old response. The Stripe API does exactly this check; it returns 400 idempotency_key_in_use when the body differs.

The in_progress status. A retry that arrives while the original is still running is the messiest case. Returning 409 lets the client back off and retry once the original is done, by which point the row will say completed and the cached response will be served. The alternative (waiting on the lock until the handler finishes) ties up a database connection per concurrent retry; under burst conditions, that exhausts your pool.

Wrapping res.json. The handler runs outside the transaction that locks the row, on purpose. You do not want a long-running handler holding a Postgres lock for its full duration. Instead, persist the result after the handler completes. The trade-off: if the process dies between the handler’s side effects and the UPDATE, the next retry sees in_progress and the client has to wait. That is a much smaller blast radius than holding the lock the whole time, and it is the trade-off Stripe makes too.

expires_at and a periodic cleanup job. Webhook providers retry for hours, sometimes days. Stripe retries for up to three days; SQS visibility timeouts can stretch even longer. A 24-hour TTL covers the realistic retry window for most providers, with a DELETE FROM idempotency_keys WHERE expires_at < now() job that runs hourly. If you need longer (PCI-style audit trail), bump the interval; the index keeps the cleanup cheap.

How to test it (the hammer)

A unit test that says “calling the function twice returns the same value” misses the entire point. The bug only happens under concurrency, so the test has to actually fire concurrent requests.

import { test, expect } from 'vitest';
import request from 'supertest';
import { app } from './app';
import { pool } from './db';

test('parallel duplicate webhooks produce one charge', async () => {
  const key = 'evt_test_' + Date.now();
  const body = { type: 'charge.succeeded', amount: 1500 };

  const sendOnce = () =>
    request(app).post('/webhooks/stripe')
      .set('idempotency-key', key)
      .send(body);

  // Hammer: 25 simultaneous deliveries of the same event.
  const responses = await Promise.all(Array.from({ length: 25 }, sendOnce));

  const succeeded = responses.filter((r) => r.status === 200).length;
  const inProgress = responses.filter((r) => r.status === 409).length;

  expect(succeeded + inProgress).toBe(25);

  // The key bit: charges_table got exactly one row.
  const { rows } = await pool.query(
    `SELECT count(*)::int AS n FROM charges WHERE event_id = $1`,
    [key],
  );
  expect(rows[0].n).toBe(1);
});

Run it ten times in a row. If even one run produces two charges, you have a race. Without the FOR UPDATE, you will see this within five iterations, every time.

For a heavier test, point vegeta at a staging instance:

echo "POST https://staging.example.com/webhooks/stripe
Idempotency-Key: evt_load_test_001
Content-Type: application/json
@./payload.json" \
  | vegeta attack -rate=200 -duration=10s \
  | vegeta report

You should see 200s and 409s only, never two 200s with side effects, never a 5xx. Run a SELECT count(*) against the resulting rows in your charges table; it should be 1.

What still bites you

A few footguns that do not show up in toy examples.

Multi-region or multi-database setups. The lock only works inside a single Postgres cluster. If you run two regional clusters with eventual replication, a simultaneous retry to two regions will both see “no row” and both proceed. The fix is either a single global table for idempotency keys (with the latency cost) or routing the same key to the same region via consistent hashing in your edge layer.

Handlers that have non-database side effects. If your handler sends an email and writes to Postgres, the idempotency table protects only the database write. The email goes out twice. Either make the email-sender itself idempotent (Resend, Postmark, and SES all accept a client-provided message ID), or batch the side effects so they live behind a single transaction-bound trigger.

The 24-hour window vs. retry horizons. Stripe retries for up to 72 hours. AWS EventBridge for up to 24. If you set the TTL too low, a very late retry sees no row and is processed as new. Match the TTL to the longest provider you talk to, plus a margin. The cleanup job means you pay nothing for the longer window beyond a small storage cost.

Body parsing inconsistencies. The fingerprint hashes JSON.stringify(req.body), which is sensitive to key ordering. Postgres-stored payloads or any pre-processing middleware that re-serializes the body can shift keys around and break fingerprint matching. If your stack does this, hash the raw body buffer before parsing, and put the middleware before express.json().

Bursty providers. Some webhook providers will redeliver a stuck event hundreds of times. If your handler is slow, every delivery sits in the in_progress window and replies 409. The provider keeps retrying. The fix is to make the handler fast (under a second is the right target) or to acknowledge the webhook immediately and process it on a queue, where the queue worker handles its own idempotency.

The metric that proves it shipped

The most useful chart is “duplicate side-effect count per day”: a SELECT event_id, count(*) FROM charges GROUP BY event_id HAVING count(*) > 1, queried daily. Before the fix, this is a non-zero number with a long tail of two or three duplicates a week. After the fix, it is zero, and any non-zero day is a real incident worth paging on.

If you cannot run that query because you do not have an event_id column on your side-effect tables, add one first. Without it, you have no way to tell whether duplicates are happening at all. They show up as “weird customer complaint about being charged twice” with no way to back-trace.

The takeaway

Idempotency is one of those things every team agrees they should do and almost no team actually implements until a customer is double-charged. The reason is that the toy version (a Set of seen IDs) does not survive process restarts, the next version (a database INSERT ... ON CONFLICT DO NOTHING) does not handle concurrent retries cleanly, and the third version is the one above, and by the time you get there, you have already shipped the bug.

Skip the first two iterations. The 30-line version with FOR UPDATE and a fingerprint is the version you want from day one. It costs a single table, one middleware, and an afternoon. The next time a customer’s card glitches and your handler returns a 502 mid-charge, you will sleep through the retry storm.

A note from Yojji

Reliability work like this (the unglamorous middleware that decides whether your billing is correct, your queue is honest, and your on-call goes a week without paging) is the kind of thing Yojji has been shipping since 2016.

Yojji is an international custom software development company with offices across Europe, the US, and the UK. Their teams specialize in the JavaScript stack (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architecture. They run dedicated, senior outstaffed teams for long-running engagements, plus full-cycle product work covering discovery, design, development, QA, and DevOps.

If your team would rather hire the practice of building safe-to-retry systems than learn it the hard way after a refund spreadsheet, Yojji is worth a conversation.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

What “idempotent” really means here

The shape of the bug

The 30 lines

Why each piece is there

How to test it (the hammer)

What still bites you

The metric that proves it shipped

The takeaway

A note from Yojji