The Event Store That Survived Black Friday Without a Single 5xx

The Problem We Were Actually Solving

At 02:47 on November 26, 2025 the event feed from the promotions service jumped from 22 k events per second to 127 k in under three minutes. The event processor, running on a three-node Kafka Streams cluster, started throwing RocksDB iterator exceptions every time it tried to compact a 2 GB changelog segment. We had tuned RocksDB with block cache at 1 GB and write buffer manager at 512 MB, but the compaction threads were still spending 400 ms per segment and the lag on partition 7 grew to 38 minutes. The on-call engineer restarted the pod, the lag recovered, and the CFO called within the hour asking why the coupon-redemption graph looked like a hockey stick.

What We Tried First (And Why It Failed)

We began with a hot path: keep the last 30 seconds of events in a Redis cluster and stream everything else to S3 via Firehose. The Redis footprint hit 95 GB at 50 k TPS and the cluster started evicting keys at 30 k RPS. We switched to DragonflyDB with 64 shards and 256 GB RAM, but the fork-based persistence still caused 80 ms p99 tail latency spikes during snapshot writes. Then we tried Kafka Streams with in-memory state stores and exactly-once semantics. The problem wasnt state store size—it was the ten minutes we lost every hour while the JVM paused for 15-second GC cycles. Switching to ZGC reduced GC pauses to 1.2 ms, but the real bottleneck had been hiding in the changelog replication factor. We had set replication.factor=3 for fault tolerance, but every partition leader rebalance triggered a 12-second rebalance storm because the controller log was 300 GB.

The Architecture Decision

We threw away the hybrid cache model and went all-in on tiered event sourcing with a custom segment router. Every event carries a header: event_id, source_ts, and segment_id. Segment routers live on the edge and fan out to three layers:

Unbounded ring buffer in jemalloc for the last 5 seconds of events (cache line aligned, no locks).
Sharded RocksDB on NVMe with 256 KB compaction granularity and zstd compression level 19. We set write_buffer_size=128 MB and max_write_buffer_number=4 to shrink compaction time to 70 ms per segment.
S3 Deep Archive for immutable audit trail, pushed via PutObject streaming so we never buffer more than 128 KB in RAM.

The key decision was enforcing segment locality: the router pins an event to the same NVMe node for 10 seconds regardless of partition changes. We measured segment-locality hit rate at 99.8 % under Black Friday load, so cross-node traffic stayed under 1.3 Gbps even while the cluster reshuffled.

What The Numbers Said After

We replayed Black Friday traffic for 14 days in staging. The new processor handled 164 k TPS at p99 420 ms on the same three bare-metal nodes that previously choked at 32 k TPS. Compaction CPU dropped from 45 % to 12 % because we shrank segment size from 2 GB to 512 MB and switched to leveled compaction. Recovery time from a full node loss fell from 13 minutes to 89 seconds because we streamed RocksDB snapshots via S3 multipart upload instead of Kafka mirroring.

The business impact: coupon redemptions spiked 4.7× but our event-driven fraud filter still caught 98.6 % of anomalous patterns. The only outage was a DNS misconfiguration that pointed the DNS endpoint to a stale load balancer—an infrastructure problem, not an event-store problem.

What I Would Do Differently

I would not have chosen Kafka Streams for the hot path. After the Black Friday fix, we migrated to Apache Pulsar with tiered storage enabled and 64 BookKeeper ledgers. The ledger compaction runs in the background and never blocks the critical path. We also sharded the segment routers at the edge so each edge POP owns only the segments it will ever see, cutting cross-region egress by 60 %. The cost delta: we added $18 k/month for extra NVMe nodes but saved $42 k/month in Redis cluster over-provisioning and reduced on-call pages from five per week to zero during sales.

推荐订阅源