Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-21 · via The Practical Developer

The daily report job is supposed to run at 06:00. It emails a summary of yesterday’s transactions to the finance team. Last Tuesday it sent two copies, 40 minutes apart. On Wednesday it sent nothing at all. The ops channel got a ping at 08:15 from someone asking if the system was broken. It was not broken. The cron daemon was running. The Node process had started the job, hung on a slow database query, and the next day’s scheduled run fired while the first was still stuck. That collision corrupts the state. The missed Wednesday run? A deploy restarted the container at 05:59 and the cron scheduler lost the tick. The double Tuesday run? No overlap guard meant the second process started in parallel, and both jobs read the same unmarked rows, computed the same totals, and sent the same emails.

Every team that runs cron jobs in production has a version of this story. The fix is not a bigger cron scheduler. It is a tiny set of rules: prevent overlap, detect missed runs, make the work idempotent, and record what happened. This post shows the Postgres-backed pattern that implements all four in about 80 lines of TypeScript. It works with node-cron, bree, node-schedule, or a simple setInterval. The scheduler is interchangeable. The state table is what matters.

What a cron job actually needs

A production cron job is not a shell script that emails output to root. It is a distributed task with four failure modes:

Overlap. The previous run is still going when the next tick fires. Without a lock, both runs execute concurrently and corrupt or duplicate work.
Missed execution. The server is down, deploying, or paused during the scheduled window. The tick is lost forever unless something notices.
Silent failure. The job throws, logs to stdout, and nobody reads the log until a downstream human complains.
Non-idempotent side effects. Even with a perfect lock, if the process crashes after the work is done but before the lock is released, a retry or recovery run may repeat the side effects.

The standard solutions people reach for (Redis locks, external job queues, Kubernetes CronJobs with suspend logic) are fine, but most teams already have Postgres. A small state table in the same database gives you overlap detection, execution history, and failure recovery without adding new infrastructure.

The state table

Create one table. It is the source of truth for every scheduled task.

CREATE TABLE cron_jobs (
  name TEXT PRIMARY KEY,
  schedule_interval INTERVAL NOT NULL,
  last_run_at TIMESTAMPTZ,
  last_duration_ms INT,
  last_status TEXT CHECK (last_status IN ('started', 'success', 'failure')),
  locked_until TIMESTAMPTZ,
  next_expected_at TIMESTAMPTZ,
  failure_count INT NOT NULL DEFAULT 0
);

INSERT INTO cron_jobs (name, schedule_interval)
VALUES ('daily-finance-report', '1 day');

The columns:

name: the job identifier.
schedule_interval: how often it should run, as a Postgres interval. 1 day, 1 hour, 15 minutes.
last_run_at: when the last execution started.
locked_until: a coarse time boundary during which the job is considered “in flight.” If a process dies, the lock expires and another runner can pick it up.
next_expected_at: the scheduled time of the next run. Used to detect missed executions.
failure_count: consecutive failures, so you can alert before the human complaint arrives.

Overlap prevention with advisory locks

Postgres advisory locks are perfect for this. They are lightweight, transactional, and you can scope them to the job name without adding a column or worrying about lock cleanup after a crash.

The worker loop looks like this:

import pg from 'pg';
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });

export async function runCronJob(
  name: string,
  work: () => Promise<void>,
  maxDurationMs: number = 300_000,
) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    // 1. Acquire the row with an update so we serialize all runners.
    const { rows } = await client.query(
      `SELECT locked_until, last_run_at, next_expected_at, last_status
       FROM cron_jobs
       WHERE name = $1
       FOR UPDATE`,
      [name],
    );

    if (rows.length === 0) {
      throw new Error(`Unknown cron job: ${name}`);
    }

    const job = rows[0];
    const now = new Date();

    // 2. Overlap check: is another runner still holding the lock?
    if (job.locked_until && job.locked_until > now) {
      await client.query('COMMIT');
      console.log(`[${name}] skipped: still locked until ${job.locked_until.toISOString()}`);
      return;
    }

    // 3. Missed-execution detection. If next_expected_at passed without a success,
    //    we still run now, but we log the gap for alerting.
    if (job.next_expected_at && job.next_expected_at < now && job.last_status !== 'success') {
      console.warn(`[${name}] missed expected run at ${job.next_expected_at.toISOString()}`);
    }

    // 4. Set the lock and mark started.
    const lockedUntil = new Date(now.getTime() + maxDurationMs);
    await client.query(
      `UPDATE cron_jobs
       SET locked_until = $1,
           last_run_at = $2,
           last_status = 'started',
           next_expected_at = $2 + schedule_interval
       WHERE name = $3`,
      [lockedUntil, now, name],
    );

    await client.query('COMMIT');
    // Release the client early; the work itself does not need this transaction.
    client.release();

    // 5. Run the actual work.
    const start = Date.now();
    let status: 'success' | 'failure' = 'success';
    try {
      await work();
    } catch (err) {
      status = 'failure';
      console.error(`[${name}] failed:`, err);
      throw err; // rethrow so your outer error tracker (Sentry, etc.) catches it
    } finally {
      // 6. Record the outcome, releasing the lock.
      const duration = Date.now() - start;
      await pool.query(
        `UPDATE cron_jobs
         SET last_status = $1,
             last_duration_ms = $2,
             locked_until = NULL,
             failure_count = CASE WHEN $1 = 'success' THEN 0 ELSE failure_count + 1 END
         WHERE name = $3`,
        [status, duration, name],
      );
    }
  } catch (err) {
    await client.query('ROLLBACK').catch(() => {});
    client.release();
    throw err;
  }
}

A few things to notice:

FOR UPDATE on the cron_jobs row means two concurrent runners serialize. One wins, the other sees locked_until in the future and skips.
The lock is time-bound (maxDurationMs). If the process dies, locked_until eventually passes and the next runner picks the job up automatically. No orphan locks.
The next_expected_at is computed from schedule_interval at the moment we mark started. That means if the job starts at 06:03 instead of 06:00 (because the server was restarting), the next expectation is 06:03 tomorrow. This prevents drift from compounding.
The actual work() runs outside the transaction. The transaction only mutates the small state row. If work() takes five minutes, you are not holding a transaction open for five minutes.

Wiring it to a scheduler

Here is the integration with node-cron. Any scheduler works; the function above is the guard.

import cron from 'node-cron';

cron.schedule('0 6 * * *', async () => {
  await runCronJob('daily-finance-report', async () => {
    const yesterday = getYesterdayRange();
    const rows = await pool.query(
      `SELECT sum(amount) as total, count(*) as cnt
       FROM transactions
       WHERE created_at BETWEEN $1 AND $2`,
      [yesterday.start, yesterday.end],
    );
    await sendFinanceEmail(rows[0]);
    await pool.query(
      `UPDATE transactions SET reported = true
       WHERE created_at BETWEEN $1 AND $2`,
      [yesterday.start, yesterday.end],
    );
  }, 300_000);
});

The scheduler fires the callback every day at 06:00. The callback may fire while a previous instance is still running, but runCronJob handles that. The work() function inside is ordinary TypeScript. It does not need to know about locking.

Testing overlap before production proves it

The worst time to discover your overlap guard is broken is when two containers both pick up the job after a deploy. Test it locally with two Node processes pointing at the same database.

Open two terminals. In both, run a small script that calls runCronJob with a 10-second sleep inside work(). The first process should acquire the lock and sleep. The second should log skipped: still locked and exit immediately. After 10 seconds, the first process should release the lock and update last_status to success.

If both processes sleep, your FOR UPDATE is not working. Check that both use the same database and that the table row exists. If the second process never runs even after the first finishes, your locked_until is not being cleared. Check the finally block in runCronJob.

This test takes 30 seconds and saves you the Tuesday double-email incident.

Detecting missed executions

The next_expected_at column exists for one reason: alerting. A separate monitoring query, run every few minutes by your health checker or Prometheus exporter, detects jobs that are overdue:

SELECT name,
       next_expected_at,
       now() - next_expected_at AS overdue_by
FROM cron_jobs
WHERE next_expected_at < now() - interval '5 minutes'
  AND (locked_until IS NULL OR locked_until < now());

If next_expected_at was 06:00 and it is now 06:10, and the lock is not held, the job missed its window. Alert on this. The five-minute grace period avoids noise from clock skew and short deploy restarts.

You can also alert on failure_count:

SELECT name, failure_count
FROM cron_jobs
WHERE failure_count >= 2;

Two consecutive failures means something is structurally wrong, not just a transient blip.

Making the work idempotent

The lock prevents double runs under normal conditions. But what if the job succeeds, writes the reported = true flag, crashes while releasing the lock, and a recovery run picks it up? The next runner will see reported = true on every row and produce an empty report. That is fine. The email might be empty, which is annoying but not harmful.

If your side effect is not naturally idempotent (sending a webhook, charging a fee, creating an invoice), build idempotency into the work itself. An idempotency key table, a processed_at timestamp on the target rows, or a deduplication hash of the inputs. The full pattern is in the post on idempotency keys. The short version: every side effect should be safe to repeat, because in distributed systems the repeat will happen eventually.

The dashboard you need

With the cron_jobs table, a one-page dashboard is trivial:

SELECT name,
       last_run_at,
       last_duration_ms,
       last_status,
       next_expected_at,
       failure_count
FROM cron_jobs
ORDER BY name;

You can ship this as an admin API route in 10 lines. It tells you which jobs are healthy, which are failing, which are overdue, and how long they take. Compare this to crontab -l on a server you cannot SSH into, where the only visibility is “did it email root?”

When this pattern is not enough

This design handles one machine running one scheduler. If you have multiple servers and you need exactly-once execution across the fleet, use a real job queue: pg-boss, Bull MQ, or a SaaS option. Those give you retry logic, dead-letter queues, and horizontal scaling out of the box.

Also, if your job needs sub-minute precision, cron is the wrong tool. Use a streaming consumer, a webhook handler, or an event-driven architecture.

The migration path from a naked cron script to this pattern is straightforward. Add the cron_jobs table, insert a row for your task, wrap the existing work function in runCronJob, and keep the scheduler unchanged. You do not need to rewrite the work. You only need to wrap it. Most migrations take 20 minutes and a single deploy.

The Postgres-backed cron guard is the 80% solution: it takes an existing cron script and makes it safe, observable, and debuggable without adding Redis, a message broker, or a new service.

The takeaway

Do not trust node-cron alone. It schedules fine, but it knows nothing about whether the previous run finished, whether the server was down during the window, or whether the work is safe to repeat. Add one small table, one locking function, and two alerting queries. The result is scheduled tasks that overlap safely, announce their own missed runs, and expose their health in a query any developer can read.

The next time someone proposes “let’s just add a cron job,” ask how it will handle overlap, how you will know if it misses a run, and what happens if it runs twice. If the answer is “it probably won’t,” show them the 80-line pattern above.

A note from Yojji

The difference between a cron job that silently fails and one that self-reports its health is the same difference that separates prototype infrastructure from production infrastructure. Yojji’s teams build that kind of operational rigor into every engagement.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms, and full-cycle product engineering, including the background-task reliability that keeps daily reports and data pipelines running without the 2 AM page.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

What a cron job actually needs

The state table

Overlap prevention with advisory locks

Wiring it to a scheduler

Testing overlap before production proves it

Detecting missed executions

Making the work idempotent

The dashboard you need

When this pattern is not enough

The takeaway

A note from Yojji