Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-24 · via The Practical Developer

Your authorization code started simple. if (user.role === 'admin') was enough. Then you added team-level access. Then document-level sharing. Then “users in the engineering group can edit the runbook.” Then “contractors can view unless the doc is marked internal.” Then nested groups inherited permissions from parent groups. Your authorize() function is now 400 lines of nested if statements, requires three database queries to answer a single check, and still returns false for a case nobody thought of.

This is the moment role-based access control (RBAC) dies. What replaces it is not attribute-based access control (ABAC) with a generic rules engine (those are slow, hard to audit, and impossible to cache). What replaces it is relation-based access control, the model Google published in the Zanzibar paper, and the model that now powers authorization at Spotify, Airbnb, Carta, and any team that outgrows roles.

This post is the Zanzibar mental model in plain language, the tuple grammar that replaces your roles table, the three rewrite rules that let you cache checks forever, and a production-grade check engine in Node.js that answers “can user:123 edit doc:456?” in a single graph traversal.

Why RBAC falls over (and ABAC is not the fix)

RBAC assigns a role to a user. The role maps to a list of permissions. It works until a permission depends on two things at once: the user, and the resource, and some context that lives in a different table.

Example: a user can view a document if they are the owner, or if the document is shared with their team, or if their team is a descendant of the document’s owner team in an org chart. In SQL, that is a recursive CTE across three tables with an OR branch for direct ownership. It is slow, hard to index, and you run it on every request.

ABAC says “write a policy function.” That function can query anything. The problem is that the policy is code. You cannot cache code. You cannot list “all documents this user can view” without evaluating the policy for every document. You cannot answer “who can view this document?” without evaluating it for every user. Audit logs become “the policy evaluated to true at this timestamp,” which tells you nothing about why.

Zanzibar’s insight is that authorization is a graph problem, not a logic problem. If you can express every permission as a small set of edges (relation tuples), you can traverse that graph with the same algorithms you already use for social networks or dependency trees. And because the graph is made of immutable tuples, you can cache the results aggressively.

The tuple grammar in five minutes

A relation tuple has four fields:

<object>#<relation>@<user>

Where object is namespace:object_id, relation is a string like owner or viewer, and user is either a direct user (user:123) or another object (group:eng#member), meaning “anyone who is a member of group:eng.”

Examples:

doc:runbook#owner@user:alice
doc:runbook#viewer@group:eng#member
group:eng#member@user:bob
group:eng#member@group:contractors#member

These four tuples mean:

Alice is an owner of doc:runbook.
Anyone who is a member of group:eng is a viewer of doc:runbook.
Bob is a member of group:eng.
Anyone who is a member of group:contractors is also a member of group:eng.

To check “can user:bob view doc:runbook?” you traverse:

Does doc:runbook#viewer@user:bob exist directly? No.
Does doc:runbook#viewer@group:eng#member exist? Yes. So: is user:bob a member of group:eng?
Does group:eng#member@user:bob exist? Yes. Access granted.

If group:eng#member had pointed at group:contractors#member, you would recurse one level deeper. The depth is bounded by your group nesting depth, which is usually under five.

That is it. No roles table. No permissions table. No policy engine. Just edges in a graph.

Namespace configuration: union, intersection, and exclusion

Real systems need more than direct tuples. You need computed relations. Zanzibar handles this with a namespace configuration that defines how relations combine.

name: doc
relation {
  name: "owner"
}
relation {
  name: "editor"
  union {
    child { _this {} }
    child { computedUserset { relation: "owner" } }
  }
}
relation {
  name: "viewer"
  union {
    child { _this {} }
    child { computedUserset { relation: "editor" } }
    child { tupleToUserset {
      tupleset { relation: "parent" }
      computedUserset { relation: "owner" }
    }}
  }
}

This says:

owner is set directly by tuples.
editor is anyone who is directly an editor, or anyone who is an owner.
viewer is anyone who is directly a viewer, or anyone who is an editor, or anyone who is an owner of the parent folder (via tupleToUserset, which follows a parent tuple to another object).

The three composition operators are:

Union (union): access if any child grants access. (Editor includes owner.)
Intersection (intersection): access only if all children grant access. (Approver requires both editor and signer.)
Exclusion (exclusion): access if the first child grants access and the second does not. (Viewer unless banned.)

In practice, union and tuple-chasing handle 95% of real use cases. Intersection is for high-sensitivity actions (e.g., releasing to production requires both deployer and oncall). Exclusion is rare and usually better handled by removing tuples than by negative logic.

The Node.js check engine

Here is a check engine that stores tuples in Postgres (a natural fit because you already have it), answers checks with recursive CTEs, and adds an in-memory LRU cache so repeated checks cost microseconds, not milliseconds.

Schema:

CREATE TABLE relation_tuples (
  namespace TEXT NOT NULL,
  object_id TEXT NOT NULL,
  relation TEXT NOT NULL,
  user_type TEXT NOT NULL, -- 'direct' or 'set'
  user_id TEXT NOT NULL,
  user_relation TEXT,      -- NULL for direct, e.g. 'member' for set
  PRIMARY KEY (namespace, object_id, relation, user_id, user_relation)
);

CREATE INDEX idx_tuple_user ON relation_tuples(user_type, user_id, user_relation);

The primary key is the forward lookup (what users have access to this object?). The secondary index is for reverse lookups (what objects does this user have access to?), which you need for list queries.

Storing tuples:

import { Pool } from 'pg';

interface Tuple {
  namespace: string;
  objectId: string;
  relation: string;
  user: string | { namespace: string; objectId: string; relation: string };
}

async function writeTuple(pool: Pool, tuple: Tuple): Promise<void> {
  const isDirect = typeof tuple.user === 'string';
  await pool.query(
    `INSERT INTO relation_tuples (namespace, object_id, relation, user_type, user_id, user_relation)
     VALUES ($1, $2, $3, $4, $5, $6)
     ON CONFLICT DO NOTHING`,
    [
      tuple.namespace,
      tuple.objectId,
      tuple.relation,
      isDirect ? 'direct' : 'set',
      isDirect ? tuple.user : tuple.user.objectId,
      isDirect ? null : tuple.user.relation,
    ]
  );
}

Checking access:

interface CheckRequest {
  namespace: string;
  objectId: string;
  relation: string;
  user: string;
}

const CHECK_CACHE = new Map<string, { result: boolean; expiry: number }>();
const CACHE_TTL_MS = 5_000;

function cacheKey(req: CheckRequest): string {
  return `${req.namespace}:${req.objectId}#${req.relation}@${req.user}`;
}

async function check(pool: Pool, req: CheckRequest, maxDepth = 10): Promise<boolean> {
  if (maxDepth <= 0) return false;

  const key = cacheKey(req);
  const cached = CHECK_CACHE.get(key);
  if (cached && cached.expiry > Date.now()) return cached.result;

  // 1. Direct tuple match.
  const direct = await pool.query(
    `SELECT 1 FROM relation_tuples
     WHERE namespace = $1 AND object_id = $2 AND relation = $3
       AND user_type = 'direct' AND user_id = $4
     LIMIT 1`,
    [req.namespace, req.objectId, req.relation, req.user]
  );

  if (direct.rowCount && direct.rowCount > 0) {
    CHECK_CACHE.set(key, { result: true, expiry: Date.now() + CACHE_TTL_MS });
    return true;
  }

  // 2. Userset match: the object delegates this relation to members of a group.
  const usersets = await pool.query(
    `SELECT user_id, user_relation FROM relation_tuples
     WHERE namespace = $1 AND object_id = $2 AND relation = $3
       AND user_type = 'set'`,
    [req.namespace, req.objectId, req.relation]
  );

  for (const row of usersets.rows) {
    const memberReq: CheckRequest = {
      namespace: row.user_id.split(':')[0],
      objectId: row.user_id.split(':')[1],
      relation: row.user_relation,
      user: req.user,
    };
    const memberCheck = await check(pool, memberReq, maxDepth - 1);
    if (memberCheck) {
      CHECK_CACHE.set(key, { result: true, expiry: Date.now() + CACHE_TTL_MS });
      return true;
    }
  }

  // 3. Computed userset: the relation includes another relation on the same object.
  // (In a real implementation this is driven by a namespace config table.)
  const computed = await getComputedRelations(req.namespace, req.relation);
  for (const parentRel of computed) {
    const parentReq: CheckRequest = {
      ...req,
      relation: parentRel,
    };
    const parentCheck = await check(pool, parentReq, maxDepth - 1);
    if (parentCheck) {
      CHECK_CACHE.set(key, { result: true, expiry: Date.now() + CACHE_TTL_MS });
      return true;
    }
  }

  CHECK_CACHE.set(key, { result: false, expiry: Date.now() + CACHE_TTL_MS });
  return false;
}

The getComputedRelations function is a placeholder for your namespace config. In a minimal system, it returns the parent relations from a map:

const NAMESPACE_CONFIG: Record<string, Record<string, string[]>> = {
  doc: {
    editor: ['owner'],
    viewer: ['editor'],
  },
};

async function getComputedRelations(ns: string, rel: string): Promise<string[]> {
  return NAMESPACE_CONFIG[ns]?.[rel] ?? [];
}

Critical fix: set a max depth. Group cycles (group:a#member@group:b#member, group:b#member@group:a#member) will recurse forever without a depth limit. In production, detect cycles with a visited-set per request instead of a depth counter. The depth counter is simpler to read.

Zanzibar’s three caching rules (and why they matter)

The Zanzibar paper claims it checks billions of tuples per second with sub-10ms latency. It does this with three rules that are easy to overlook and hard to retrofit.

1. New enemy problem: clocks lie.

If Alice removes Bob from group:eng at t=10, and a check at t=11 reads a cache entry written at t=9 that says Bob is a member, the cache is stale. Zanzibar solves this with a global timestamp (a hybrid logical clock, or ZooKeeper in practice). Every write gets a timestamp; every check reads at a timestamp. Caches are keyed by (cache_key, timestamp), and the cache is invalidated not by events but by time monotonicity.

In a smaller system without a global clock, you accept a bounded inconsistency window (the 5-second TTL above), or you stamp writes with a Postgres xid and include the transaction ID in the cache key. The practical fix most teams use: keep the TTL under 100ms for active objects, and evict aggressively on write.

2. Leopard caching: cache the subgraph, not the result.

A naive cache stores check(doc:runbook, viewer, user:bob) -> true. Zanzibar caches group:eng#member -> {user:alice, user:bob, ...} (the full set of members). If the next check asks about user:carol, the subgraph is already cached. This is a trade-off: higher memory use, fewer cache misses, and it makes list queries (“all viewers of this document”) fast.

For most teams, a per-check LRU with a short TTL is enough until you hit 100k+ checks per second. At that point, move to Redis with set-caching for the hot usersets.

3. Check depth matters more than tuple count.

A check that traverses five levels of group nesting is slow even if there are only 100 tuples total. Flatten group hierarchies aggressively. If group:eng has 500 members through three layers of nesting, materialize the transitive closure in a separate table (group_transitive_members) and update it when group tuples change. This turns a depth-5 traversal into a single index lookup.

Listing objects: the other half of the problem

Checking can user:123 view doc:runbook? is one operation. Rendering a dashboard that says “here are the 20 documents you can view, paginated” is another. Zanzibar calls this Read (as opposed to Check).

A simple Read implementation uses the reverse index and the transitive member table:

SELECT DISTINCT namespace, object_id, relation
FROM relation_tuples
WHERE user_type = 'direct'
  AND user_id = 'user:123'
  AND relation = 'viewer'
UNION
SELECT DISTINCT t.namespace, t.object_id, t.relation
FROM relation_tuples t
JOIN group_transitive_members gtm
  ON t.user_type = 'set'
 AND t.user_id = gtm.group_namespace || ':' || gtm.group_id
 AND t.user_relation = gtm.group_relation
WHERE gtm.member_id = 'user:123'
  AND t.relation = 'viewer'
ORDER BY namespace, object_id
LIMIT 20;

This query is why the reverse index (idx_tuple_user) and the transitive member table matter. Without them, listing user-visible objects requires evaluating the policy for every object in the database.

Production checklist

Set max recursion depth (or cycle detection) on every Check. A single malformed tuple can turn your auth service into a stack overflow.
Use transactions for writes. Two concurrent writeTuple calls for the same subject can race and duplicate subtleties you never test.
Cache negative results. A miss (“user cannot view”) is as cacheable as a hit. Without negative caching, repeated unauthorized requests become expensive database traversals.
Log tuple changes to an outbox. Authorization is audit-critical. Every writeTuple and deleteTuple should emit an event to Kafka or a Postgres outbox table so you can answer “when did Bob gain access to the runbook?”
Avoid exclusion in namespace configs. “Access unless banned” is harder to reason about and cache than “remove the tuple when banned.” Move exclusion logic to tuple writes.
Test with a snapshot of production tuples. Authorization bugs are edge cases in graph shape. Export a sanitized snapshot of your production tuple graph and run property-based tests against it.

When not to use Zanzibar

Single-role systems. If your app has admin/user/guest and no nesting, RBAC is simpler and faster.
Attribute-heavy rules. If access depends on time-of-day, IP geofencing, or dynamic quotas, you need an ABAC engine (like OPA or Cedar) alongside the tuple graph, or the tuple graph becomes a dressed-up policy engine.
Ultra-low latency checks. If you need <1ms checks at millions per second, you need a dedicated service (SpiceDB, Keto, Google’s Zanzibar) with a compiled query planner and a distributed cache. The Postgres implementation above is good for <1k req/s per instance.

A note from Yojji

The kind of work this post describes (replacing a brittle roles matrix with an auditable graph, hardening against recursion cycles, and sizing the cache layer so listing queries stay fast) is the foundational backend engineering that most teams skip until authorization becomes a production incident. It is also the kind of work Yojji’s senior engineers bake into the full-stack products they ship.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, Google Cloud), and microservices architectures, and they run both dedicated senior outstaffed teams and full-cycle product engagements covering discovery, design, development, QA, and DevOps.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Why RBAC falls over (and ABAC is not the fix)

The tuple grammar in five minutes

Namespace configuration: union, intersection, and exclusion

The Node.js check engine

Zanzibar’s three caching rules (and why they matter)

Listing objects: the other half of the problem

Production checklist

When not to use Zanzibar

A note from Yojji