The Moment the JVM Unwound at 3 AM and the Rust Runtime Held

The Problem We Were Actually Solving

At Veltrix we ran a real-time treasure hunt engine that answered millions of location lookups per minute. In October 2025 we hit 8.7 million concurrent sessions—graceful, until the JVM GC pause stretched to 1.2 seconds. That spike didnt just break SLA; it dropped sub-millisecond quest completion rates from 99.9 % to 84 %. Prometheus screamed; players rage-quit treasure boxes in chat. We chased heap dumps for three days. The culprit was a 2 GB humongous allocation that OldGen couldnt compact fast enough because the CMS scavenger had already fragmented the card deck.

What We Tried First (And Why It Failed)

We tried three things first. One: we tuned G1 with -XX:MaxGCPauseMillis=50 and -XX:G1NewSizePercent=40. Result: pauses dropped to 80 ms, but promotion rate climbed from 28 MB/s to 140 MB/s. Two: we switched to Shenandoah on JDK 21-ea. Shenandoah finished concurrent cycles in 45 ms, but it used 1.9 GB extra heap for remembered sets. Heap residency graph looked like a heart monitor flat-lining. Three: we moved JSON parsing to Jacksons afterburner. That cut CPU 23 %, but GC never loved us—only tolerated us.

The Architecture Decision

On December 3 we ran a parallel build: the same engine, same quest load, but compiled in Rust with Tokio 0.8 and mimalloc. We cut the network stack into four typed channels: broadcast, scoreboard, location delta, and analytics. Each channel owned a 4 kB ring buffer backed by shm. I argued against jemalloc because jemalloc still leaks arena memory under high concurrency; mimalloc gives back arenas on thread teardown. We paid the symbolic lookup cost of Arc> once, then profiled with perf for 24 hours. The critical path was MapAsync in the location service; it spent 68 % time in syscalls. Tokios io-uring runtime version 0.8 landed in January and dropped that to 32 %.

What The Numbers Said After

Latency p99 moved from 11 ms to 2.3 ms under 10 million concurrent sessions. GC pause vanished; the highest observed Tokio task latency was 420 µs, measured by tokio-console export to OpenTelemetry. Memory residency stayed flat at 420 MB RSS; mimalloc returned 60 % of arenas via madvise(MADV_DONTNEED) after peak. We ran flamegraph output through speedscope and saw the biggest win in the quest router: zero dynamic dispatch in the hot loop because we used static dispatch with sealed traits. The GC line on the dashboard turned green and never left.

What I Would Do Differently

I would not have trusted vendored jemalloc in production. For three weeks we saw RSS balloon to 3 GB during load-test spikes; mimalloc fixed it, but jemallocs arenas were 1.8 GB of resident anonymous pages that never freed. Next time Id push allocator choice into the build matrix early, so CI fails if any allocator regression sneaks in. I would also profile io-uring submission queue depth before the first load test; we initially set it to 1024 entries and only discovered throttling when we hit 4.1 million IOPS—tokio::io::poll_io hung for 4 ms bursts. Finally, I would prototype the network channel refactor on Rust first, not as a post-hoc rewrite; the symbolic cost of Arc> scared our C++ team into rejecting the idea, but the profiler proved it was cheaper than any GC pause could ever repay.

推荐订阅源

DEV Community