惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

DEV Community

Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Tesla P40 in a Homelab: 24GB of Inference on a Budget Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed I asked three AI models the same API question. Only one had it right. Implementing Saga Pattern With Lambda Durable Function Why does AI forget what you said (and how to fix it) I built a daily Wordle-style game for AI tools - Here's how Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses
Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler
Sabir · 2026-05-26 · via DEV Community

If you build enterprise software, you know the pain: you spend months solving complex architectural challenges, navigating network partitions, and building highly resilient systems, and you can never show it to anyone because it is locked behind corporate NDAs.

At my day job, I worked heavily with a distributed job scheduler backed by Cassandra. Navigating those massive asynchronous workflows, database bottlenecks, and unpredictable worker crashes taught me invaluable lessons about distributed systems. I was incredibly proud of the architectural patterns I had mastered, but when I went to update my portfolio, I realized I had zero public proof of my backend engineering depth.

So, I decided to build a brand new, open-source project to demonstrate those concepts.

The result is TaskForge. This isn’t a clone of my previous work. It is a fresh implementation inspired by the failure modes I had learned to handle. It gave me something rare: a completely free playground. Instead of navigating rigid legacy architectures and corporate red tape, I had a blank canvas. I took the opportunity to experiment with different DevOps constraints, build out a strict monorepo, and completely change the database engine.

TaskForge Console

Here is a look under the hood of how I built it, the intentional tradeoffs I made, and the boss fights I encountered along the way.

TaskForge Architecture

The Core Engine: A Strict State Machine

At the heart of TaskForge is a highly structured lifecycle. Every job in the system moves through a strict state machine inside PostgreSQL:

  • PENDING: The job is scheduled in the DB and waiting for its time to run.
  • PROCESSING: The Scheduler has reserved the job and published it to RabbitMQ.
  • RUNNING: A Worker has claimed the job from the queue and is executing the business logic.
  • COMPLETED / FAILED: The terminal states (success, or max attempts exhausted).

The Great Pivot & The Atomic Claim

The biggest architectural shift I made for this new implementation was moving away from Cassandra. Cassandra is a beast for decentralized, high-throughput systems, but trying to achieve a global, atomic lock on a specific job without creating a massive bottleneck in a masterless database is a headache.

For TaskForge, I pivoted to PostgreSQL. I wanted to utilize the ACID guarantees of a relational database to handle race conditions safely.

When the background Scheduler sweeps the database for due jobs, it uses SELECT ... FOR UPDATE SKIP LOCKED. This is a critical pattern: it allows multiple scheduler instances to safely sweep for PENDING jobs in parallel without blocking each other.

Once the scheduler pushes the job to RabbitMQ, the Node.js Worker picks it up. If five workers pull the same duplicate message from the queue, we have to guarantee they don’t execute the same claim. The worker executes a targeted conditional update:

// The Atomic Worker Claim
const { rows } = await pool.query(
  `UPDATE jobs
   SET status = 'RUNNING',
       attempts = attempts + 1,
       locked_at = NOW(),
       locked_by = $2
   WHERE id = $1
     AND status = 'PROCESSING'
     AND run_at <= NOW()
   RETURNING *`,
  [jobId, WORKER_ID]
);
const job = rows[0];

if (!job) {
  // Another worker already claimed it, or it was cancelled.
  channel.ack(msg);
  return;
}

Enter fullscreen mode Exit fullscreen mode

This query returns the job only to the worker that won the conditional update. This safely prevents duplicate execution of the same database job claim (though because RabbitMQ guarantees at-least-once delivery, external side effects still require idempotency).

The Message Broker: Gaps and Guarantees

To ensure the workers don’t get overwhelmed, RabbitMQ is configured defensively. I utilized prefetch(1) to ensure workers only ingest what they can immediately process, enabled publisher confirms, and routed poisoned messages to a Dead-Letter Queue (DLQ).

One of the most interesting challenges here is the DB-before-RabbitMQ gap.

In TaskForge, the Scheduler updates a job to PROCESSING in Postgres before publishing it to RabbitMQ. If the publish fails due to a network blip, the job is stuck in PROCESSING but is not in the queue. This is a deliberate at-least-once tradeoff. The system relies on a stale lease recovery sweeper that notices jobs stuck in PROCESSING for too long and kicks them back to PENDING.

Engineering for Chaos

The “happy path” in distributed systems is boring. Real engineering happens when things break.

If a TaskForge worker fails to process a job due to a failing third-party API, it catches the error and calculates an exponential backoff delay. Notice that because we already incremented the attempts during the atomic claim above, the retry path simply reads the current attempts and kicks the job back to the penalty box:


// The Penalty Box (Exponential Backoff)
const currentAttempts = jobState.attempts;
const delaySeconds = Math.pow(2, currentAttempts) * 5;
const retryResult = await pool.query(
  `UPDATE jobs
   SET status = 'PENDING',
       run_at = NOW() + ($1 * INTERVAL '1 second'),
       locked_at = NULL,
       locked_by = NULL
   WHERE id = $2
     AND locked_by = $3`,
  [delaySeconds, jobId, WORKER_ID]
);

if (retryResult.rowCount === 0) {
  channel.ack(msg);
  return;
}

Enter fullscreen mode Exit fullscreen mode

But what if the worker container violently loses power mid-process? To prevent a “Zombie Worker” scenario, I intercept the operating system’s shutdown signals. Graceful shutdown stops new consumption and gives active jobs time to finish.


// Intercepting the Reaper (Graceful Shutdown)
const shutdownConsumer = async (signal: string) => {
  isShuttingDown = true;
  if (rabbitChannel && consumerTag) {
    await rabbitChannel.cancel(consumerTag);
  }

  const deadline = Date.now() + SHUTDOWN_TIMEOUT_MS;
  while (activeJobs > 0 && Date.now() < deadline) {
    await sleep(500);
  }

  if (rabbitChannel) await rabbitChannel.close();
  if (rabbitConnection) await rabbitConnection.close();
  await pool.end();

  process.exit(activeJobs === 0 ? 0 : 1);
};

Enter fullscreen mode Exit fullscreen mode

If the timeout hits before jobs finish, closing RabbitMQ causes those unacknowledged messages to be safely redelivered to another worker, while the stale lease recovery sweeper later reconciles the abandoned DB locks.

Proving It: How I Tested Failure

You can’t claim a system is built around production failure modes without proving it. The strongest part of this project isn’t the code, it is the testing suite.

I wrote integration tests using Vitest and Testcontainers to programmatically spin up real Postgres and RabbitMQ instances. I specifically wrote tests to break the system: simulating duplicate message deliveries, testing stale lock recovery, simulating child worker crashes, forcing RabbitMQ disconnects, and verifying the exponential backoff math.

The DevOps Reality Check

I set a stubborn goal for this project: I wanted the entire stack deployed to the cloud for free.

This introduced some serious constraints. I deployed the Next.js UI to Vercel and the Postgres DB to Neon’s serverless platform. But the backend Workers and RabbitMQ require long-lived TCP connections - serverless wouldn’t cut it. I had to provision a free-tier AWS EC2 t3.micro instance.

A t3.micro gives you exactly 1GiB of RAM. I had to completely gut my local docker-compose.yml for production. Because Postgres was now hosted remotely on Neon, I ripped the DB container out of the compose file, dynamically injected the Neon connection strings via environment variables, and configured a 2GiB Linux Swap File directly on the EC2's SSD. Without that swap file, the OS would have instantly OOM-killed RabbitMQ the moment memory spiked.

Intentional Tradeoffs

Any system architecture is a series of compromises. It is important to be transparent about what was left on the table:

  1. At-Least-Once Delivery: The system favors at-least-once over exactly-once execution. Next steps for a true production environment would include implementing a robust Outbox Pattern to close the DB-to-Queue publish gap.

  2. Demo Friction: The public observer UI uses simple rate-limiting rather than robust authentication. This was an intentional choice to make the project instantly accessible for anyone wanting to test the system under load.

  3. Observability: Currently, the system relies on dashboard-grade health checks and audit logs rather than full Prometheus/Grafana metric scraping.

A Note on AI in the Workflow

My focus throughout this project was on architectural decision-making, distributed systems logic, and edge case handling. The backend logic, locking mechanisms, and state machine were crafted line by line, with every argument and condition placed with deliberate precision.

For the frontend, I used AI coding agents to scaffold React/Tailwind boilerplate, style the observer UI, and handle the visual polish. The goal was simply to get a clean, functional front door in place quickly. Delegating routine frontend styling freed up time to focus on the parts of the system that required more careful thought.

Final Thoughts

Building a greenfield project inspired by the battle scars of your day job is an incredibly rewarding experience. TaskForge isn’t just a portfolio piece; it is a failure-tested blueprint of how I approach system architecture when the training wheels come off.

You can watch the system handle load in real-time on the Live Console, or dive into the architecture in the GitHub repo.