惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Jina AI
Jina AI
S
SegmentFault 最新的问题
The Cloudflare Blog
The Last Watchdog
The Last Watchdog
AI
AI
Security Latest
Security Latest
T
Threatpost
MyScale Blog
MyScale Blog
S
Security Archives - TechRepublic
月光博客
月光博客
D
Darknet – Hacking Tools, Hacker News & Cyber Security
WordPress大学
WordPress大学
K
Kaspersky official blog
aimingoo的专栏
aimingoo的专栏
Cisco Talos Blog
Cisco Talos Blog
S
Security @ Cisco Blogs
Martin Fowler
Martin Fowler
V
V2EX
G
GRAHAM CLULEY
P
Proofpoint News Feed
GbyAI
GbyAI
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
P
Proofpoint News Feed
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
小众软件
小众软件
腾讯CDC
L
LINUX DO - 热门话题
Google Online Security Blog
Google Online Security Blog
E
Exploit-DB.com RSS Feed
T
Tailwind CSS Blog
AWS News Blog
AWS News Blog
博客园 - 【当耐特】
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
H
Hacker News: Front Page
Google DeepMind News
Google DeepMind News
V
Vulnerabilities – Threatpost
Attack and Defense Labs
Attack and Defense Labs
Latest news
Latest news
S
Securelist
Apple Machine Learning Research
Apple Machine Learning Research
M
MIT News - Artificial intelligence
TaoSecurity Blog
TaoSecurity Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
NISL@THU
NISL@THU

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Learn Command Line Interface (CLI) Development with Dart: From Zero to a Fully Published Developer Tool How to Build a Live Options Database in Python – A Complete Guide How to Migrate to S3 Native State Locking in Terraform How to Use SCons to Build Software Projects [Full Handbook] How to Run Open Source LLMs Locally and in the Cloud QuRT: The Real-Time OS Inside Your Phone's Processor [Full Handbook] The Real Infrastructure Behind Remote Work (It’s Not Just Wi-Fi) The Lithography Handbook: Machines, Markets, and the Next Wave of Semiconductor Startups ITCM vs DTCM vs DDR: Embedded Memory Types Explained [Full Handbook] AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1) How to Build a Market Research Copilot with MCP and Python [Full Handbook] How to Build a Scoped Note-Taking API with Django Rest Framework and SimpleJWT The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands Mastering the JavaScript Event Loop Data Science Insights: Why the Mean Lies When Handling Messy Retail Data How to Build High-Ranking SEO Landing Page How to Query Data in DynamoDB Using .Net How to Unblock Your AI PR Review Bottleneck: A Tech Lead’s Guide to Building a Codebase-Aware Reviewer How to Navigate Microservices as a Frontend Engineer How to Compress PDF Files in the Browser Using JavaScript (Step-by-Step) Stanford's youngest instructor talks InfoSec, AI, and catching cheaters - Rachel Fernandez interview [Podcast #217] Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python How to Build a Multi-Agent AI System with LangGraph, MCP, and A2A [Full Book] How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway How to Dockerize a Go Application – Full Step-by-Step Walkthrough Learn Hardware, Cloud, DevOps, Networking, Security, Databases, DNS, Git, and Linux Inside TreeHacks 2026, Stanford’s Elite Student Hakc Inside Stanford’s Elite Student Hackathon [Full Documentary] How to Measure Your AI Citation Rate Across ChatGPT, Perplexity, and Claude How to Deploy a Full-Stack Next.js App on Cloudflare Workers with GitHub Actions CI/CD How to Build a Multi-Tenant SaaS Platform with Next.js, Express, and Prisma How I Completed 15 freeCodeCamp Certifications in 4 Months: A Structured Learning Journey How to Build an Agentic Terminal Workflow with GitHub Copilot CLI and MCP Servers How AI Changed the Economics of Writing Clean Code How to Apply STRIDE Threat Modeling and SonarQube Analysis for Secure Software Development How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS How to Split PDF Files in the Browser Using JavaScript (Step-by-Step) How to Build Your Own Language-Specific LLM [Full Handbook] How to Build a Self-Learning RAG System with Knowledge Reflection How to Trace Multi-Agent AI Swarms with Jaeger v2 How I Tested Malaysia's Open Data Portals with Plain English How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik The Hidden Tax of Infrastructure: Why Your Team Shouldn’t Be Running It Anymore From Metrics to Meaning: How PaaS Helps Developers Understand Production From Symptoms to Root Cause: How to Use the 5 Whys Technique Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP 3D Web Development with Blender and Three.js How to Fix a Failing GitHub PR: Debugging CI, Lint Errors, and Build Errors Step by Step How to Merge PDF Files in the Browser Using JavaScript (Step-by-Step) How to Handle Stripe Webhooks Reliably with Background Jobs How to Build an Automatic Knowledge Graph for Your Blog with PHP and JSON-LD Understanding Proxies and Reverse Proxies: Your Gateway to Secure Networking The Evolution of Nvidia Blackwell GPU Memory Architecture How to Use PostgreSQL as a Cache, Queue, and Search Engine The New Definition of Software Engineering in the Age of AI Reclaim Your Time – Master Automation with Zapier How to Create Dynamic Emails in Go with React Email Why Many Beginner Self-Taught Developers Struggle (And What to Do About It) How to Build a Headless WordPress Frontend with Astro SSR on Cloudflare Pages How to Make Your GitHub Profile Stand Out How to Use Context Hub (chub) to Build a Companion Relevance Engine Why Chrome OS Is the Operating System the AI Era Was Built For How to Build Microservices-Based REST APIs for Healthcare Portals How to friction-max your learning with software engineer Jessica Rose [Podcast #216] Shadow AI Explained: Why Employees Are Using AI Behind Your Back Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams How Database Indexes Work – A Practical Guide with PostgreSQL Examples How to Streamline Search in Web Applications with Elasticsearch How to Build an Open Source Data Lake for Batch Ingestion OpenAI Codex Essentials – AI Assisted Agentic Development Course Learn Software System Design How to Generate PDF Files in the Browser Using JavaScript (With a Real Invoice Example) How to Get Started with Terraform Service-to-Service Communication: When to Use REST, gRPC, and Event-Driven Messaging A Developer’s Guide to Lazy Loading in React and Next.js The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained. United States Residential Proxy: Why Local IP Accuracy Matters for SERP, Ads, and Pricing How to Build a Fashion App That Helps You Organize Your Wardrobe How to Build an Admin Dashboard Sidebar with shadcn/ui and Base UI The AI Governance Handbook: How to Build Responsible AI Systems That Actually Ship How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible How to Use Mixins in Flutter [Full Handbook] How to Prep for Technical Interviews – A Guide for Web Developers GPT-5.4 vs GLM-5: Is Open Source Finally Matching Proprietary AI? Data Visualization Tools for Svelte Developers How to Keep Human Experts Visible in Your AI-Assisted Codebase Efficient Data Processing in Python: Batch vs Streaming Pipelines Explained How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU) How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript How to Build a Positioning-Based Crude Oil Strategy in Python [Full Handbook] How to learn programming and CS in the AI hype era – interview with dev and prof Mark Mahoney [Podcast #215] CUDA Programming for NVIDIA H100s How to Build Reliable AI Systems. How to Build an Online Marketplace with Next.js, Express, and Stripe Connect How to Build a Cost-Efficient AI Agent with Tiered Model Routing The WebCodecs Handbook: Native Video Processing in the Browser The Bluetooth LE Audio Handbook: From "Why Does My Call Sound Like a Tin Can?" to AOSP Implementation How to Set Up OpenClaw and Design an A2A Plugin Bridge
The Saga Pattern in Node.js: How to Roll Back Distributed Transactions Across Microservices
Abigail Rennemeyer · 2026-06-13 · via freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The Saga Pattern in Node.js: How to Roll Back Distributed Transactions Across Microservices

Building reliable workflows across multiple microservices is challenging. In a monolith, a database transaction can ensure that multiple operations either succeed or fail together. But once data is spread across different services and databases, that guarantee disappears.

This is where the Saga Pattern comes in. Instead of using distributed transactions, a saga coordinates a sequence of local transactions and runs compensation actions when something goes wrong.

In this article, we'll build an orchestrated Saga Pattern using NestJS, gRPC, PostgreSQL, and Sequelize. You'll learn how to coordinate work across services, implement compensation-based rollbacks, handle idempotency, and track workflow progress in a production-style microservice architecture.

Table of Contents

  • Prerequisites

  • 1. Introduction

  • 2. The Problem in One Picture

  • 3. Why You Need a Saga

  • 4. Choreography vs Orchestration

    • Choreography

    • Orchestration

  • 5. The Example Project

  • 6. Architecture

  • 7. The Saga Flow, Step by Step

  • 8. The State Machine

  • 9. Implementing the Orchestrator

    • Creating the Saga Record

    • The Main Loop

    • A Single Step in Detail

    • Habits Worth Copying

  • 10. Implementing the Participant

  • 11. Rollback (Compensation)

    • On the Orchestrator Side

    • On the Participant Side

    • Rules of a Good Compensation

    • What Happens if the Compensation Itself Fails?

  • 12. Tracking, Idempotency and Observability

    • Orchestrator Side — agency_onboarding_sagas

    • Participant Side — agency_provision_records

    • Observability for Free

  • 13. Testing a Saga

  • 14. When NOT to Use a Saga

  • 15. Trade-offs and Lessons Learned

  • 16. Conclusion

Prerequisites

This article assumes you're already familiar with some backend development concepts. You don't need prior experience with the Saga Pattern, but you should be comfortable with:

  • JavaScript, TypeScript, Node.js

  • NestJS fundamentals (controllers, services, dependency injection)

  • Basic PostgreSQL concepts

  • Database transactions

  • Docker (recommended for local development)

  • Microservice architecture basics

  • gRPC fundamentals (helpful but not required)

If you've already built a few backend services with NestJS and PostgreSQL, you'll have everything you need to follow this guide.

1. Introduction

A saga is a sequence of local transactions across multiple services. Each step commits its own database transaction. If a later step fails, the saga runs compensating transactions to semantically undo the work already committed.

The pattern was first described by Hector Garcia-Molina and Kenneth Salem in 1987 for long-lived database transactions. It was rediscovered a decade ago when companies started splitting monoliths into microservices and realised that the database transaction — the single most powerful tool in a backend developer's belt — stops working at the service boundary.

This article walks through an orchestrated saga in Node.js (NestJS + gRPC) for onboarding an agency, where two services must agree on a single business outcome:

  • agency-service — owns the agency record.

  • auth-service — owns the organization, user and role.

If either side fails, the system must end up as if nothing ever happened. No half-created users, orphan organizations, or 3am Slack threads.

2. The Problem in One Picture

Here's the bug a saga is built to prevent:

Step 1: auth-service     ✅ creates Organization #42
Step 2: auth-service     ✅ creates User #99
Step 3: agency-service   ❌ fails (DB down, validation, network blip…)

Result without a saga:
   Organization #42 and User #99 still exist.
   There is no Agency row.
   The user can log in but has nothing to manage.
   Support gets a ticket. Engineer writes a one-off SQL cleanup.
   Repeat every week.

The saga's job is to detect that step 3 failed and explicitly delete Organization #42 and User #99, so the system is consistent again — even though those rows live in a different service's database.

3. Why You Need a Saga

In a monolith, you wrap everything in one DB transaction and let the database handle atomicity:

await sequelize.transaction(async (tx) => {
  await Organization.create({...}, { transaction: tx });
  await User.create({...}, { transaction: tx });
  await Agency.create({...}, { transaction: tx });
});

In microservices, each service has its own database. You can't wrap two services in one ACID transaction. The classic alternatives all have problems:

Option Problem
Two-Phase Commit (2PC) Locks rows across services, coordinator is a single point of failure, and doesn't scale. Most modern databases don't support it well across HTTP/gRPC.
"Just hope it works" Leaves orphan users / billing rows when half the flow fails. Real data corruption — and the longer the system runs, the more orphans accumulate.
Manual cleanup scripts Works for a week. Bugs hide for months. New engineers don't know they exist.
Eventual consistency without compensation Fine for some domains (analytics) but completely wrong for billing, identity, or anything with money.
Saga pattern Each service commits locally. The orchestrator owns the workflow and runs explicit compensation on failure. It's auditable, restartable, and reasonable.

The saga gives you eventual consistency with a clear, auditable rollback path — without distributed locks.

4. Choreography vs Orchestration

There are two ways to implement a saga:

Choreography

With Choreography, services emit events and other services subscribe and react.

auth-service → emits "UserCreated"
agency-service → listens, creates agency, emits "AgencyCreated"
billing-service → listens, creates subscription…

It's simple at first, but brittle later. The workflow is scattered across N codebases. Nobody owns it. Debugging means tracing events across logs. Adding a step means changing several services.

Orchestration

With Orchestration, one service is the conductor. It calls the others in order.

orchestrator:
   1. authClient.provisionAccount(...)
   2. agencyRepo.create(...)
   3. authClient.sendWelcomeEmail(...)

There's slightly more coupling here (the orchestrator imports clients), but the entire workflow lives in one file. Onboarding new engineers becomes a one-hour task. Adding a step is a single PR.

Pick orchestration unless you have a strong reason not to. This article — and the reference implementation — uses orchestration.

5. The Example Project

Our goal here is to create an Agency in the system. This is the moment a new B2B customer signs up.

It requires two services to agree on a single outcome:

auth-service must create:

  • an Organization row (the tenant)

  • a User row (the agency admin who will log in)

  • a UserRole row linking the user to the AGENCY_ADMIN role

agency-service must create:

  • an Agency row containing business details (size, registration number, website, branches…), linked to the user/organization above

These rows have foreign-key relationships within a service, but not across services — Postgres can't enforce that the user in auth's DB matches the authUserId in agency's DB. The application has to do it.

auth-service DB                    agency-service DB
─────────────────                  ─────────────────
organizations  ◄────────┐
   │                    │
   │ (1:1)              │   foreign reference (no FK)
   ▼                    │           agencies
users  ──────► user_roles                     ─ authUserId
                                              └ authOrganizationId

If step 2 fails after step 1 succeeded, we end up with a user who can authenticate but has no agency — the exact bug from 2. That's what the saga prevents.

6. Architecture

                     ┌───────────────────────────────┐
                     │        API Gateway            │
                     └──────────────┬────────────────┘
                                    │ HTTP
                                    ▼
   ┌──────────────────────────────────────────────────┐
   │              agency-service                      │
   │   ┌─────────────────────────────────────────┐    │
   │   │   AgencyOnboardingOrchestrator (SAGA)   │    │
   │   └───────────────┬─────────────────────────┘    │
   │                   │ writes state                 │
   │                   ▼                              │
   │      agency_onboarding_sagas  (Postgres)         │
   └───────────────┬─────────────────┬────────────────┘
                   │ gRPC            │ gRPC
       provisionAgencyAccount   compensateAgencyAccount
                   │                 │
                   ▼                 ▼
   ┌──────────────────────────────────────────────────┐
   │              auth-service                        │
   │   AgencyProvisioningService  (Participant)       │
   │                                                  │
   │   organizations · users · user_roles             │
   │   agency_provision_records  ← idempotency log    │
   └──────────────────────────────────────────────────┘

Three components do all the work:

  1. AgencyOnboardingOrchestrator in agency-service — drives the workflow.

  2. agency_onboarding_sagas table in agency-service — the durable log of the saga's progress.

  3. AgencyProvisioningService in auth-service — exposes a do operation (provisionAgencyAccount) and an undo operation (compensateAgencyAccount). It's backed by its own agency_provision_records idempotency table.

The orchestrator never reaches into the auth database directly. The boundary is enforced by gRPC.

7. The Saga Flow, Step by Step

This sequence diagram shows the complete lifecycle of the onboarding saga. The workflow begins when a client sends a request to create a new agency. The orchestrator first creates a saga record in its database and marks it as STARTED, giving it a durable record of the workflow before any business action takes place.

At a high level, the orchestrator begins by creating a saga record and then asks auth-service to provision the organization, user, and role. Once that succeeds, the orchestrator creates the agency record in its own database.

If every step succeeds, the saga reaches the COMPLETED state. If the agency creation fails after the auth resources have already been created, the orchestrator triggers a compensation step that instructs auth-service to remove everything it previously provisioned.

The key idea is that each service commits its own local transaction, while the saga coordinates the overall business workflow and ensures the system can return to a consistent state when failures occur.

sequenceDiagram
    autonumber
    participant C as Client
    participant AS as agency-service<br/>Orchestrator
    participant DB1 as saga store
    participant AU as auth-service
    participant DB2 as auth DB

    C->>AS: POST /agencies
    AS->>DB1: INSERT saga (STARTED, payload)
    AS->>AU: provisionAgencyAccount(sagaId, …)
    AU->>DB2: BEGIN TX
    AU->>DB2: create org + user + role + provision_record
    AU->>DB2: COMMIT
    AU-->>AS: { userId, organizationId, roleId }
    AS->>DB1: UPDATE saga (AUTH_PROVISIONED)
    AS->>AS: create Agency row
    alt Agency row OK
        AS->>DB1: UPDATE saga (AGENCY_CREATED → COMPLETED)
        AS->>AU: sendAgencyWelcomeEmail (non-critical)
        AS-->>C: 200 OK + sagaId
    else Agency row fails
        AS->>DB1: UPDATE saga (COMPENSATING)
        AS->>AU: compensateAgencyAccount(sagaId)
        AU->>DB2: BEGIN TX
        AU->>DB2: delete role + token + user + org + record
        AU->>DB2: COMMIT
        AS->>DB1: UPDATE saga (COMPENSATED → FAILED)
        AS-->>C: 5xx + error code
    end

Read this once top to bottom and you'll understand the entire onboarding workflow. That's the value of orchestration — the sequence diagram is the architecture.

8. The State Machine

Every transition is written to agency_onboarding_sagas before the next step runs. That is what makes the saga observable and recoverable.

export enum AgencyOnboardingSagaStatus {
  STARTED            = 'STARTED',            // Row exists, no side effects yet
  AUTH_PROVISIONED   = 'AUTH_PROVISIONED',   // Auth side committed
  AGENCY_CREATED     = 'AGENCY_CREATED',     // Agency row committed
  COMPLETED          = 'COMPLETED',          // Happy-path terminal state
  COMPENSATING       = 'COMPENSATING',       // Rollback in progress
  COMPENSATED        = 'COMPENSATED',        // Rollback finished
  FAILED             = 'FAILED',             // Terminal failure (with or without compensation)
}

Why so many states? Because "what went wrong here?" is a question someone will ask at 2am. A saga that only stores success | failure is useless for forensics.

                ┌── auth fails ──────────► FAILED  (nothing to compensate)
                │
STARTED ──► AUTH_PROVISIONED ──► AGENCY_CREATED ──► COMPLETED  (happy path)
                                       │
                       agency fails ───┘
                                       ▼
                                COMPENSATING
                                       │
                                       ▼
                                COMPENSATED ──► FAILED  (consistent again)

The “point of no return” is AUTH_PROVISIONED. Before it, we can fail fast — there's nothing to undo. After it, every failure path must go through compensation.

9. Implementing the Orchestrator

The orchestrator is the only place that knows the workflow. Each step is a private method, and each step persists its result before returning.

Creating the Saga Record

// agency-onboarding.saga.repository.ts
async createSaga(payload: CreateAgencyOrchestrationInput) {
  return this.sagaModel.create({
    sagaId: randomUUID(),                          // correlation id for everything
    status: AgencyOnboardingSagaStatus.STARTED,
    currentStep: 'STARTED',
    payload,                                       // full input snapshot for replay
  });
}

The sagaId is a UUID generated once and propagated to every downstream call. It's the single identifier that ties the saga log on the orchestrator side to the provision record on the participant side.

The Main Loop

// agency-onboarding.orchestrator.ts (trimmed for the article)
async execute(input: CreateAgencyOrchestrationInput) {
  const saga = await this.sagaRepository.createSaga(input); // STARTED

  try {
    // Step 1 — auth-service work
    const authStep = await this.provisionAuth(saga, input);
    if (!authStep.ok) {
      await this.markFailed(saga, authStep.failure); // nothing to compensate
      return authStep.failure;
    }

    // Step 2 — agency-service work
    let activeSaga = authStep.saga; // status: AUTH_PROVISIONED
    try {
      activeSaga = await this.createAgencyRow(activeSaga, input, authStep.authIds);
    } catch (err) {
      // The expensive case: undo what auth-service did
      await this.compensateAuth(activeSaga, 'SAGA_FAILED');
      const failure = mapSagaFailure(err.message, 'SAGA_FAILED', 'CREATE_AGENCY');
      await this.markFailed(activeSaga, failure);
      return failure;
    }

    // Step 3 — mark done and run non-critical side effects
    activeSaga = await this.sagaRepository.updateSaga(activeSaga, {
      status: AgencyOnboardingSagaStatus.COMPLETED,
    });
    await this.sendWelcomeEmail(input, activeSaga); // best-effort

    return mapSagaSuccess(activeSaga, await this.agencyModel.findByPk(activeSaga.agencyId!));
  } catch (error) {
    // Defensive catch-all (lost DB connection, unexpected throw)
    await this.compensateAuth(saga, 'SAGA_FAILED');
    const failure = mapSagaFailure(error.message, 'SAGA_FAILED', 'SAGA');
    await this.markFailed(saga, failure);
    return failure;
  }
}

A Single Step in Detail

private async provisionAuth(saga: AgencyOnboardingSaga, input: ...) {
  this.logger.log(`[${saga.sagaId}] PROVISION_AUTH`);

  const auth = await firstValueFrom(
    this.authClient.provisionAgencyAccount({
      sagaId: saga.sagaId,                  // <-- correlation
      organizationName: input.agencyName.trim(),
      email: input.email.trim().toLowerCase(),
      // …
    }),
  );

  if (!auth.status || !auth.data) {
    return { ok: false, failure: mapAuthProvisionFailure(auth) };
  }

  // Persist the IDs we will need if we have to compensate later
  const updated = await this.sagaRepository.updateSaga(saga, {
    authOrganizationId: Number(auth.data.organizationId),
    authUserId: Number(auth.data.userId),
    authUserRoleId: Number(auth.data.userRoleId),
    status: AgencyOnboardingSagaStatus.AUTH_PROVISIONED,
  });

  return { ok: true, saga: updated, authIds: auth.data };
}

The line that does most of the work is the updateSaga call. It stores the foreign IDs returned by auth-service on the saga row, so even if the orchestrator process crashes and restarts, a recovery job can read that row and still know what to compensate.

Habits Worth Copying

  • Persist after every successful step, including the IDs you'll need to undo it.

  • Distinguish critical vs non-critical steps. Welcome emails, audit logs and analytics events are not worth rolling a saga back for. They're best-effort.

  • One log line per transition, prefixed with [${sagaId}]. Grep is your debugger.

10. Implementing the Participant

The participant (auth-service) wraps all of its own work in a local DB transaction. Inside that boundary it's still ACID — the saga only handles the cross-service problem.

// agency-provisioning.service.ts (trimmed)
async provisionAgencyAccount(req: ProvisionAgencyAccountInput) {

  // 1. Idempotency — return the previous result if this sagaId already provisioned.
  const existing = await this.provisionRecordModel.findOne({
    where: { sagaId: req.sagaId },
  });
  if (existing) {
    return serviceSuccess('Agency admin already onboarded', {
      userId: Number(existing.userId),
      organizationId: Number(existing.organizationId),
      userRoleId: Number(existing.roleId),
    });
  }

  // 2. Domain validation BEFORE the transaction (fail fast).
  if (await this.emailExists(req.email)) {
    return serviceFailure('Email already exists', { code: 'EMAIL_EXISTS' });
  }
  if (await this.organizationExists(req.organizationName)) {
    return serviceFailure('Organization already exists', { code: 'ORGANIZATION_EXISTS' });
  }

  // 3. The actual work — atomic at the auth-service boundary.
  return withSequelizeTransaction(this.sequelize, async (tx) => {
    const org = await this.organizationModel.create({ ... }, { transaction: tx });
    const user = await this.userModel.create({ ..., organizationId: org.id }, { transaction: tx });
    await this.userRoleModel.create({ userId: user.id, roleId: agencyAdminRole.id }, { transaction: tx });

    // The audit record that makes compensation possible later.
    await this.provisionRecordModel.create(
      { sagaId: req.sagaId, organizationId: org.id, userId: user.id, roleId: agencyAdminRole.id },
      { transaction: tx },
    );

    return serviceSuccess('Provisioned', {
      userId: user.id, organizationId: org.id, userRoleId: agencyAdminRole.id,
    });
  });
}

Three things make this method "saga-safe":

  1. Idempotency check first: If the orchestrator retries (network blip, gRPC timeout), the second call is a no-op that returns the same IDs. No duplicate users.

  2. Validation outside the transaction: Cheap reads first, expensive writes second.

  3. One transaction wraps every write: If any insert fails, the whole thing rolls back automatically. The orchestrator sees a clean failure response and knows nothing was persisted.

The agency_provision_records table is the single most important piece of the participant. It's both the idempotency key and the compensation lookup — keyed by the same sagaId the orchestrator uses.

11. Rollback (Compensation)

Compensation is just another gRPC call. The orchestrator sends the sagaId and the IDs it remembers. The participant deletes everything it created, in reverse dependency order, inside its own DB transaction.

On the Orchestrator Side

private async compensateAuth(saga: AgencyOnboardingSaga, errorCode?: string) {
  if (!saga.authUserId && !saga.authOrganizationId) {
    // Nothing was provisioned — nothing to compensate.
    return;
  }

  // Mark the saga as compensating BEFORE the call, so the row is consistent
  // even if the compensating RPC times out.
  await this.sagaRepository.updateSaga(saga, {
    status: AgencyOnboardingSagaStatus.COMPENSATING,
    currentStep: 'COMPENSATING',
    errorCode,
  });

  try {
    const rollback = await firstValueFrom(this.authClient.compensateAgencyAccount({
      sagaId: saga.sagaId,
      organizationId: saga.authOrganizationId,
      userId: saga.authUserId,
    }));
    if (!rollback.status) {
      this.logger.error(`[\({saga.sagaId}] Auth compensation returned failure: \){rollback.message}`);
    }
  } catch (err) {
    this.logger.error(`[\({saga.sagaId}] Auth compensation RPC failed: \){err.message}`);
  }

  await this.sagaRepository.updateSaga(saga, {
    status: AgencyOnboardingSagaStatus.COMPENSATED,
    currentStep: 'COMPENSATED',
  });
}

On the Participant Side

private async rollbackProvisionedAuth(req, sagaId: string, tx: Transaction) {
  // Use the saga log as the source of truth — even if the caller forgot IDs.
  const record = await this.provisionRecordModel.findOne({
    where: { sagaId }, transaction: tx,
  });
  const userId         = req.userId         ?? record?.userId;
  const organizationId = req.organizationId ?? record?.organizationId;

  if (userId) {
    const user = await this.userModel.findByPk(userId, { transaction: tx, attributes: ['email'] });
    await this.userRoleModel.destroy({ where: { userId }, transaction: tx });
    if (user?.email) {
      await this.passwordResetTokenModel.destroy({ where: { email: user.email }, transaction: tx });
    }
    await this.userModel.destroy({ where: { id: userId }, transaction: tx });
  }
  if (organizationId) {
    await this.organizationModel.destroy({ where: { id: organizationId }, transaction: tx });
  }
  if (record) {
    await record.destroy({ transaction: tx });
  }
}

Rules of a Good Compensation

  1. Reverse the order of creation: Children first (user_roles, tokens), then parents (users, organizations). The same rule you follow for DROP TABLE statements.

  2. Be idempotent: Receiving the same sagaId twice must be safe — every destroy is a no-op if the row is already gone.

  3. Use the saga log, not just the request: If the caller forgets an ID or sends a partial payload, look it up by sagaId. Defence in depth.

  4. Wrap it in a local transaction: The rollback must itself be atomic — half-undone is worse than not-undone.

  5. Always close the loop on the orchestrator side: Mark COMPENSATED even if the RPC failed. The failure should also be surfaced (log, metric, alert). A stuck COMPENSATING row is an operational landmine.

What Happens if the Compensation Itself Fails?

This is the worst case in any saga design. There are three reasonable strategies:

First, you can retry with exponential backoff. This works for transient failures (network, deadlocks).

Second, you can dead-letter the saga — write it to a "needs human attention" queue and alert.

Third, you can expose a manual rollback endpoint. This reference implementation does that via RollbackAgencyOnboarding gRPC, so an operator can replay compensation with the same sagaId.

A production system should combine all three. The pattern doesn't decide for you. You decide based on your business risk.

12. Tracking, Idempotency and Observability

Two tables, both keyed by the same UUID sagaId, give you full traceability across services.

Orchestrator Side — agency_onboarding_sagas

column purpose
sagaId (UUID, unique) Propagated to every RPC. The join key across services.
status Current state in the state machine.
currentStep Human-readable label for dashboards (PROVISION_AUTH, CREATE_AGENCY…).
payload (JSONB) Snapshot of the input — used for replay, debug, support.
authOrganizationId, authUserId, authUserRoleId Foreign IDs needed for compensation.
agencyId Set once the agency row exists.
errorCode, errorMessage Filled on failure.
createdAt, updatedAt Timeline for the saga.

A real row in COMPLETED state looks roughly like this:

{
  "sagaId": "0a4f3e2c-7b11-4f8d-9a2c-90b6f5f5b8a1",
  "status": "COMPLETED",
  "currentStep": "COMPLETED",
  "agencyId": 17,
  "authOrganizationId": 42,
  "authUserId": 99,
  "authUserRoleId": 3,
  "errorCode": null,
  "errorMessage": null,
  "payload": { "agencyName": "Acme Education", "email": "admin@acme.com", "...": "..." },
  "createdAt": "2026-05-22T10:14:32.118Z",
  "updatedAt": "2026-05-22T10:14:33.412Z"
}

Participant Side — agency_provision_records

column purpose
sagaId (unique) Idempotency key. The same sagaId from the orchestrator.
userId, organizationId, roleId What to delete on compensation.
createdAt, updatedAt Audit timestamps.

Observability for Free

Because every log line is prefixed with [${sagaId}], a single grep across both services gives the full timeline:

[0a4f3e2c…] PROVISION_AUTH                  agency-service
[0a4f3e2c…] provisionAgencyAccount: ok      auth-service
[0a4f3e2c…] CREATE_AGENCY                   agency-service
[0a4f3e2c…] Agency step failed: ...         agency-service
[0a4f3e2c…] Auth compensation completed     auth-service

In a structured-logging setup (Loki, Elasticsearch, Datadog) this becomes a one-click filter. The sagaId is your distributed trace.

13. Testing a Saga

A saga is just a state machine, so the test matrix is finite and small. Cover at least these cases:

# Scenario Expected end state
1 Happy path COMPLETED, agency exists, user exists
2 Auth step fails (e.g. email exists) FAILED, no rows on either side
3 Agency step fails COMPENSATED, auth rows gone, no agency
4 Compensation RPC times out COMPENSATING → operator-driven recovery
5 Caller retries with the same sagaId Second call returns the first call's result; no duplicate rows
6 Welcome email fails COMPLETED still — non-critical step did not cascade

Two practical tips for testing:

First, mock the gRPC client at the orchestrator level, not the network. You want to assert that compensateAgencyAccount was called with the right sagaId, not that bytes hit a socket.

Second, spin up a real Postgres in integration tests (Testcontainers, or a Docker Compose postgres service). The saga state machine is too easy to "test" against a mock and too easy to break against a real DB.

14. When NOT to Use a Saga

Sagas are not free. Skip them when:

  • One service does all the writes. Use a regular DB transaction. Don't reinvent the wheel.

  • The workflow is read-only or analytical. No rollback semantics exist for a SELECT.

  • The "rollback" is impossible. You sent a real email. You charged a credit card and the gateway doesn't support refunds. In those cases, design forward: send an apology email, queue a manual refund. Sagas can't unsend physical actions.

  • You don't actually have multiple services yet. A saga in a monolith is over-engineering. Wait until the service boundary is real.

A saga adds a state table, a compensation method per step, and an operational habit of grepping by sagaId. That cost is worth paying when the alternative is orphaned data — and not before.

15. Trade-offs and Lessons Learned

Things that worked well in this design:

  • Synchronous orchestration is easier to debug than choreography. A new engineer reads one file and understands the whole flow.

  • Idempotency at the participant is non-negotiable. Retries from the orchestrator must be safe. Build it in from day one — retro-fitting is painful.

  • The saga table replaces tribal knowledge. Ops can answer "what happened to this signup?" with a single SQL query. The payload JSONB is gold during incidents.

  • sagaId as the trace key plays nicely with OpenTelemetry / Datadog / Loki — no extra infra to set up.

Things to know before copying this pattern:

  • A failing compensation is the worst case. If compensateAgencyAccount itself errors, you have inconsistent state. Plan for retries + dead-letter + a manual rollback endpoint from the start.

  • Non-critical steps must be marked explicitly. Here, the welcome email is allowed to fail without rolling back the agency. Don't accidentally compensate over a flaky SMTP provider.

  • Sagas aren't a replacement for local transactions. Inside each service, still use a real DB transaction. The saga only handles the cross-service seam.

  • Synchronous gRPC is simple but couples availability. If auth-service is down, agency creation fails. Swap the gRPC calls for a durable message bus (RabbitMQ / Kafka) and treat each step as a command + reply when you need higher resilience.

  • The orchestrator becomes a critical service. Treat its uptime accordingly — monitor saga durations, alert on stuck COMPENSATING rows, and run more than one replica.

16. Conclusion

The saga pattern isn't magic. It's a disciplined version of what experienced engineers already do by hand: commit locally, record what you did, and know how to undo it.

In Node.js with NestJS, you only need three ingredients:

  1. A state table to track the saga.

  2. An orchestrator that drives the workflow and writes that state.

  3. A participant that exposes a do and an undo operation, both idempotent and keyed by sagaId.

Get those three right and your microservices can offer the same "all-or-nothing" feel as a monolithic transaction — without the operational pain of distributed locks.

Start simple, use orchestration, make every step idempotent, persist before you call, and always know how to undo. That's the whole pattern.



Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started