Why Your Treasure Hunt Engine Kept Crashing at 1.2M Concurrent Connections

The problem we were actually solving was not how to make the treasure hunt more fun, but how to keep the leaderboard from exploding the heap when 1.2 million players hammered the Redis cluster at exactly 3:17 PM every Tuesday. The marketing team called this peak engagement; I called it a memory avalanche. We were running Veltrixs open-source treasure-hunt engine on a 32 GB RAM instance, and every spike turned the node into a swap-to-death zombie. The leaderboard tier used an in-memory sorted set that Redis advertises as O(log N) per operation, but at N=1.2 M the constant factor was high enough that the Lua scripts were spending more time context-switching than updating scores. We hit 400 MB of RESident memory per process, and once the Go garbage collector paused for 420 ms, the TCP backlog overflowed and dropped 37k ZADD requests. That was the first time the CEO noticed the word cache.

What we tried first (and why it failed)

We started with the obvious: bump the Redis instance to 128 GB, move the leaderboard to a separate node, and slap a read replica in front. The operator docs called this horizontal scaling. What they did not mention was that Redis Cluster splits the sorted set across slots, so a single players score update might fan out to three different primaries. When players near the top of the leaderboard updated their scores, the cluster saw a surge of cross-slot traffic that turned the network card into a bottleneck. We were pushing 8 Gbps of intra-cluster traffic with only 3 Gbps of actual game updates. The Redis cluster bus protocol started dropping gossip messages, and the cluster lost the view of slot ownership for 11 seconds. Those 11 seconds were enough for two dozen nodes to start a new election cycle, and the leaderboard froze while the slot map reconverged.

The architecture decision

We ripped out the Redis sorted set entirely and built a two-tiered leaderboard in Go. The first tier is a write-through sharded map: 64 shards, each an in-memory radix tree keyed by player ID. Each shard is a goroutine that batches writes into a 4 KB buffer and flushes asynchronously to a single PostgreSQL table partitioned by shard ID and date. We chose PostgreSQL not because it was fast—it absolutely is not—but because it gave us a real transaction log. If the process crashes, the WAL guarantees we can restore the last committed batch in under 100 ms. The second tier is a materialized view rebuilt every 30 seconds by a worker that runs a windowed SQL query over the last hours partitions. We serve reads from the materialized view, so leaderboard queries never touch the in-memory tree during the spike.

We added one more trick: a small LRU cache in front of the radix trees. Cache entries have a 5-minute TTL, but we use a version vector embedded in the score itself. When a players score changes by any amount, we increment the version and invalidate the cache entry atomically in the same PostgreSQL transaction that commits the new score. The cache hit rate hovers between 72 % and 78 %, but even on a miss the read path is a single radix lookup plus a single hash lookup instead of scanning the entire set.

What the numbers said after

After the change the 99th percentile leaderboard latency dropped from 420 ms to 3 ms. The peak memory footprint per shard stayed under 200 MB, and the Go process GC pause averaged 1.2 ms even under 1.4 M concurrent connections. The write throughput increased from 2.1k ops/sec to 220k ops/sec, and the cross-shard traffic vanished because every update stayed on one PostgreSQL connection. PostgreSQL CPU utilization peaked at 35 % during the Tuesday spike, which gave us headroom for the next 3× growth without touching the cluster again.

What I would do differently

I would not have trusted the Redis Cluster slot map to stay consistent during network hiccups. If I had to do it again, I would replace the PostgreSQL worker with a streaming job that consumes the write-ahead log directly via pg_logical and builds the materialized view incrementally. That way the view never lags more than a few hundred milliseconds behind the sharded in-memory trees, and we avoid the 30-second staleness window.

I would also instrument the radix tree with a Prometheus histogram that records the number of pointer dereferences per lookup. That metric told us the trees were compact enough that we rarely walked more than six levels, but the histogram made it obvious when a hot shard started growing unpredictably. Without that visibility we would have missed the 4 % of queries that were doing 12 levels of traversal after a few weeks of churn.

推荐订阅源

DEV Community