Invariant-Driven Architecture: 20M transactions on a €80/mo Cloud VM.

📎 This is Part 2. Part 1 — Postgres-grade serializable at 20k ops/s on a laptop (don't try this at home) presented 20,000+ durable, invariant-validated transactions per second — on a MacBook Air M3, 8 cores, fan barely audible.

A laptop number is only half the story. The natural next test is whether the same architecture holds on a small, cheap public VM with network-attached storage and strict fsync(2) durability — three constraints that each, on its own, tend to move the bottleneck by an order of magnitude on most stacks. So that's what I ran.

The Setup

VM: Scaleway POP2-2C-8G — 2 vCPUs AMD EPYC 7543 @ 2.8 GHz, 8 GiB RAM. Yes — two vCPUs.
Disk: SBS Block storage, 100 GB, 15,000 provisioned IOPS. Network-attached, not local NVMe.
OS: Ubuntu 22.04, kernel 5.15. Linux strict fsync(2) — every commit hits the SSD for real, no Apple-style controller-cache shortcut.
Software: the exact same codebase that ran on the M3, with NATS + Mongo + Mongo Express as the side stack in Docker.
Price: ~€80/month order-of-magnitude — ~€54 of compute (POP2-2C-8G at €0.0735/h) plus ~€20–25 of SBS volume with provisioned IOPS. Hourly that's €0.11.

Two names, one stack

Two terms surface across this series. They sit at different layers, so it's worth pinning them now.

Invariant-Driven Architecture (IDA) — the design philosophy. The system, end to end (ingress, sequencer, system, storage), is engineered around a single obsession: validating and enforcing business invariants on every commit, with no compromise on throughput. It's a DDD-based philosophy.
Atomic State Platform — the concrete implementation. The software we're benchmarking right here — that just put down 33,091 sustained items per second on a €80/mo Scaleway VM.

IDA is how. Atomic State is what. The €80/mo VM is where. The rest of this post is how much.

1. The Visual Punch — 10 minutes non-stop, 20 million items 📈

The first test is the one that closes the "but does it sustain?" question forever.

BENCH_DURATION=10m BATCH_SIZE=1000 make cloud-bench-broker

Result, over 600 consecutive seconds on the POP2:

Metric	Value
Throughput sustained	33,091 items/s
Total items committed	19,855,000
WAL bytes written	7.6 GB
p99 round-trip	71 ms
Integrity audit	✅ INTEGRITY_OK (1,000 aggregates)
Durability	FSYNC-ON (Linux strict)

Yes, that's higher than the laptop — and on network-attached storage, not local NVMe. SBS is block storage over the data-center fabric; every fsync round-trips to the SBS backend before it returns. Measured fsync latency on this volume: 2.0 ms (vs 130 µs on the laptop's local NVMe — 15× slower per call). The 33,091 items/s holds despite the network disk, not because of a fast one. Pebble's group commit amortises that 2 ms across the whole batch — roughly 2 µs of effective fsync cost per item. That ratio is the architectural lever.

More on why the cloud beats the laptop on this same scenario in §4. For now, the headline isn't the number. The headline is the steadiness.

I sampled the engine's /metrics and the host's resource counters every 15 s through the entire run. Three things every senior engineer should care about:

pebble_health.l0_files stays in [0, 6] the whole 10 minutes. Never higher. Compaction kicks in the moment L0 hits 6, completes before the next batch needs the slot.
compactions_in_progress oscillates 0 ↔ 1 — Pebble is keeping up in real time, never queueing.
estimated_debt_bytes never crosses 100 MB — less than 1% of the WAL written.

At this scale — 34 MB/s sustained ingestion, ~20 GB of raw data committed — a mis-tuned LSM-tree triggers a Write Stall: the engine has to pause writes while compaction catches up, and the curve falls off a cliff. We see none of that. The puts counter climbs linearly from 0 to 19.85 M across the whole 10 minutes.

CPU%, L0 files and debt_MB never drift outside the bands shown above during the active 10-minute window.

[preflight] measured: lat_avg=2 051 µs iops=487
[preflight] Mac ref:  lat_avg=131 µs   iops=7 568 (M-series NVMe, same fio command)
[preflight] gate:     lat_avg <= 5 000 µs (override via SCW_FSYNC_MAX_LAT_US)
✅ PASS - instance is fsync-fast enough for a meaningful bench.

That's what "industrial-grade" looks like on a VM that costs €0.11/h.

2. The Ceiling Demo — batch=1000 vs batch=2000

Now the test that tells us where the wall is. Same VM, same 1-minute window, double the batch:

+2.8% throughput for +59% tail latency. Past 1,000 items per batch, more batching is just queueing. We've squeezed the disk dry.

Let's do the math on what the bottleneck is. The cloud-up preflight measured 2.0 ms average fsync latency on this SBS volume (fio --rw=randwrite --bs=4k --direct=1 --sync=1, the same command everywhere). At 33k items/s and one fsync per batch, that's 60 ms of fsync wall-time per second — 6% of the budget. Pebble's group commit already amortises fsync across the whole batch.

The remaining 94% of wall-time is CPU:

HTTP/JSON serialization on the producer
1,000 invariant evaluations per chunk
Pebble's batch building + commit

The network-attached disk is no longer the wall. Two 2.8 GHz AMD EPYC vCPUs at sustained 70% CPU are.

The fact that doubling the batch buys 2.8% is the experimental proof: there's no more disk to amortise, only CPU to share.

The full numbers, side by side

Full side-by-side numbers for batch=1000 vs batch=2000.

Read the bold rows together: we halved the number of fsyncs (33.3 → 17.1 per second) and throughput barely moved (+2.8%). If the disk were the wall, cutting fsyncs in half would have bought a lot more. It didn't — because CPU holds a flat ~70% plateau across both runs. That's the ceiling, in two numbers.

Engine + system samples, every 5 s (abridged — start / mid / end). Same CPU band, same L0 transient peak (10), and acks (= fsyncs) running at exactly half the rate on bs=2000 for the same puts — one fsync per batch, batches twice as big, no throughput dividend.

3. The Proof by Absurdity — 64 Workers on 2 Cores

This is the test every junior engineer expects to help throughput, and every senior engineer expects to kill it. The reflex: "if the engine is slow, scale out the producer."

make cloud-bench-perf-dense   # 64 workers, batch=100

Same VM. Same disk. Same system. The only change is the producer pattern.

Read that twice. CPU usage drops from 70% to 30% — and throughput collapses 5.5×.

That asymmetry is the signature of context-switching hell. The Linux scheduler spends its cycles swapping 64 producer goroutines + 64 HTTP request handlers + the system + 3 Docker containers across 2 physical cores. The real work-per-cycle drops; the cores look idle because they spend their time saving and restoring register state. The engine's single-writer queue becomes the rendezvous point — workers pile up — p99 explodes to 1.7 seconds.

This is the design principle of the engine stated as a measurement:

The engine is single-writer by design — one writer to Pebble, no lock contention.
The ingress must batch upstream of the engine to amortise fsync.
More producer threads = less throughput on a CPU-constrained host. Mathematically, not ideologically.

4. The Mac Parallel — More Cores Buy More Headroom

The same three scenarios, on the MacBook Air M3 from Part 1 (8 cores, 16 GB):

Three readings:

(a) On a single-worker pattern, the cloud Linux wins. Mac fsync(2) is ~15× faster than the cloud SBS per call (~130 µs vs ~2 ms), but at batch=1000 the per-batch fsync is amortised over 1,000 items — and the rest of the pipeline (HTTP serialization, internal sequencing, scheduler) now dominates.

(b) When the workload has CPU work to spare and concurrency is contained, the extra cores cash in. Going from batch=1000 to batch=2000 adds compute per batch but releases parallelism inside the engine (more items concurrently invariant-checked by the system). The Mac has 6 extra cores to spend on it, so its throughput climbs +54% (23,755 → 36,549). The cloud, pinned at 2 vCPUs, gains only +2.8% on the identical change — it has no spare core to convert the extra parallelism into work.

(c) The Mac does not flinch at 64 workers — it has the cores to absorb them. This is the exact scenario where the 2-vCPU cloud VM collapsed to 5,992 (§3). The 8-core Mac runs the identical 64-worker, batch=100, payload=0 B workload at 43,392 items/s — 7.2× the cloud, and above its own single-worker broker run (23,755). Context-switching only becomes hell when threads vastly outnumber cores; with 8 cores the scheduler keeps up and the engine's single writer stays fed. The perf-dense collapse was never about the workload — it was about core count.

The hierarchy of constraints is universal, regardless of OS, disk brand, or vendor SKU:

batch size > producer concurrency > raw fsync speed

Match those three to your hardware and your real ingress pattern, and the throughput follows. Get them wrong and a top-end laptop loses to a €80/month VM — or wins against one — depending on the day.

The Numbers, at a Glance

€80/month is the order of magnitude — ~€54 of compute, ~€20–25 of provisioned-IOPS SBS volume. Hourly: €0.11. The whole bench session that produced the cloud rows of this table cost ~€0.05 of cloud time — at this scale, validating an architecture decision on a representative VM is essentially free.

How the runs were captured

Each row above came from the same harness: a fresh VM with a clean Pebble, system + Mongo projection started as systemd units, NATS + Mongo + Mongo Express brought up in Docker, and the engine's /metrics endpoint sampled every 5–15 seconds during the run. An fio pre-flight (--rw=randwrite --bs=4k --direct=1 --sync=1, 5 s) gates the run on a configurable fsync latency threshold; on this VM it measured 2,051 µs average, well under the 5 ms gate. Every number in the tables above is either a direct read from /metrics, a count from the bench JSON output, or a delta between consecutive samples — nothing synthetic, no extrapolation.

Conclusion

20 million durable, invariant-validated transactions in 10 minutes, on a public-cloud VM that costs less per month than a SaaS subscription. Every run ends with INTEGRITY_OK.

What Atomic State Platform does on the M3 in Part 1, it does unchanged on a €80/mo Linux VM:

The single-writer engine means the disk stops being the wall as soon as you batch upstream.
Smaller, slower-fsync CPUs reach the same throughput envelope on cheap cloud as a powerful laptop — provided the producer pattern is cooperative.
Bigger machines buy headroom for concurrency, not raw throughput.

You don't need a 64-core server. You don't need an NVMe array. You don't need a datacenter rack. You need the right pattern, applied to the right SKU.

Part 3 unpacks the deeper claim: how Invariant-Driven Architecture lets Atomic State Platform sidestep the classical database stack outright — no Postgres, no Redis, no event-sourcing scaffolding. Just a system with a single fsync per batch, doing what nothing else does on a €80/mo box.

推荐订阅源

DEV Community