惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

How I wired Stripe subscriptions to Supabase in Next.js 15 (the parts tutorials skip) Introduction to A2A and Agent Search Why Doesn't Linux Break Every Week? The "AI" Label Is Losing Its Meaning, and Companies Are the Ones Diluting It Bucky Fuller's To-Do List: Can AI Finally Solve the World's Cataloged Problems? My $10/Month VPS Gets 659 SSH Attacks per Day — Here's What 4 Weeks of Running an Autonomous AI Has Taught Me About Infrastructure Speed Up Your WordPress Site in 30 Minutes: A No-Plugin Performance Guide Breaking Code: The Addiction Nobody in Tech Will Admit To Nobody Reads AI Safety Papers. But 649 People Upvoted a Letter to an LLM. The Pope wrote about me Je vibe-coded app werkt. Maar kan hij ook live? Audit-trail-by-construction: a thesis for spec-driven AI coding Day 8 - Sparse embedding - RAG How we made our Mac launcher feel instant by killing slow providers How we made our Mac launcher feel instant by killing slow providers Enterprise AI Agent Orchestration Patterns How to build your first MCP server in 10 minutes Claude Code's plan mode is prompt engineering, not hard enforcement Built a C# AI Agent That Researches Errors and Suggests Fixes From Shell Scripts to MCP Servers: How SEO Broke My Brain (in a Good Way) AI Agent Platform Buyer's Guide: 12 Questions to Ask Before You Sign 🦋 I Built a Living Terminal Animation with Hermes Agent — Here's How It Went. AI Agents Are Coming for Your WordPress Admin Panel, and That's Not a Bad Thing Tailscale + k3s in a 2‑node homelab: why I use Tailscale ONLY for the control plane When NOT to Use AI Agents: A Realistic Framework Human-in-the-Loop Patterns for High-Stakes AI Agent Decisions LLM Cost Optimization for Agent Workflows: A Practical Guide An Evolving Strategy for Knowledge Work: From Human-In-the-Loop to Human-Before-the-Loop Why I Wake Up at 5am to Run (And Why You Might Want To) I Scanned 260 Packages that your are using and Found 43 With Security Vulnerabilities The Easiest Way to Implement Theme Toggling in React 19 using next-themes & Tailwind CSS v4 AI skill testing: yes, your prompts need regression tests Why We Built AnToAnt: Designing Software Before Writing Code How I Built an End-to-End HR Attrition Dashboard Using MySQL & Power BI Why Hytale Treasure Hunt Engines Stumble Before 1,000 Concurrent Diggers: What Veltrix Does Not Document How to Implement Dark/Light Mode with No Flickers in Next.js Building My First Solana Transfer CLI Tool | #100DaysOfSolana What Is OAuth Token Exchange? CLI wrapper for Cloudflare Tunnel with Zero Trust Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking The Death of the Junior Developer Is Greatly Exaggerated How I Built a Programmatic SEO Site with 16,750 Pages Using FastAPI and PostgreSQL Toward a Standard Model for Agent Memory I Applied SLA Concepts to My Email Inbox — Here's What I Learned Building the Chrome Extension How Spring Data JPA, JPA, and Hibernate work together What useOptimistic Actually Saves You The Vibe Tax: How Unvalidated AI Code Is Flooding the Market and Driving Up Technical Debt Building My First MCP Server with Claude and Python Azure Blob Storage for Beginners: Private Access, SAS Tokens & Cost Savings Explained I'm building a TypeScript data grid where config reads like English Revamped Proof for Finish-Up-A-Thon Selectors and its uses in HTML & CSS Bronto for Fastly: Real-Time CDN Logging That Actually Scales I Built a Local Interview Coach That Learns From Every Submission With Hermes Agent. Genesis-GAL: Multiplatform Core Architecture (C++, Kotlin, Python) for CPU Thermal Optimization & Jitter Mitigation Why Delta, Iceberg, and Hudi Can't Write to FSx S3 Access Points — And What Works Instead Why I’m Exploring a PHP-Based KiwiPress Redshift Spectrum + Lake Formation — Enterprise Governance on NAS Data Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy The New Digital Divide: Will "Vibe Coding" Really Make Everyone a Developer? I Was Tired of Broken Deployments, So I Built This CLI Tool Vibe Coding vs. System Architecture: Why "It Works" is Not the Same as "It Scales" How iOS developers actually get paid: a practical guide to Apple's fiscal calendar How to Grayscale Images of Out-of-Stock Products in WooCommerce Using CSS I'm a Master's Student in AI & Big Data. And AI Just Gave Me My Freedom Back. npm Scripts and package.json: The Complete Guide (2026) How to Boost Customer Loyalty with Automatic Discount Codes in WooCommerce How to Hide Out-of-Stock Products on Your WordPress Website The Easiest Way to Add Dark Mode to Your Website How to Build an Enterprise Browser — Branding The Champion: Showing Up for the Ecosystem How I Escaped Claude & Cursor Limits: The Ultimate Free Local AI Coding Setup with Ollama + Continue.dev (2026 Guide) Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU Building an Error Monitoring Tool Without Pricing Overages Checking Internet Status in Basic4Android Binary Tree Recursion in Interviews: The Call Stack Diagnostic Just another curious tinker, looking for a community... Token-level eval harness for tool-calling agents: what we wired up Why Some Codebases Are Hard to Understand: Cognitive Surface Area and the Hidden Cost of System Navigation Trust Boundaries in Client-Side Health Apps The fastest way to update Node.js on your Mac Prompt is Not Runtime: Why I Rejected LLM State-Machines for Deterministic FinTech SDD en proyectos brownfield: pros, contras y la estrategia que realmente funciona Hexagonal Architecture in Practice: Ports, Adapters, and Tests That Skip the Database Your Playwright Tests Will Need Refactoring. Here's How to Make It Painless Development of a custom API layer for Framer CMS integration Stream 24/7 on YouTube with Ant Media Server Chat With Your Raspberry Pi — Control GPIO, Read Sensors, and Manage Services via Telegram Using Garudust Run OpenAI Codex CLI on Claude, Gemini, or Llama — in 50 lines of C# Token economics for AI agents: why workflow ownership matters more than task automation Why SMS Codes Are No Longer Enough for Business Security Communicate Ideas Visually: Let AI Run the Feedback Loop Building an Autonomous AI Hiring Agent with Multi-Agent Runtime Orchestration 🚀 Validating lists in Okyline: uniqueness, order, and cross-element rules Base64 encoding visualizer I Built a Browser Game Engine Inside WordPress Without Canvas or WebGL. Here's Why Designing Website Analytics for AI Crawlers Without Surveillance Forget Usernames and Passwords: A Web2 Developer’s Guide to Solana Identity Usage-Based Billing for AI Agents with FastAPI and Kong 30 Days of AI Agents Buying From a Real WooCommerce Store. Here's What the Data Says.
The Event Store That Survived Black Friday Without a Single 5xx
Lisa Zulu · 2026-05-27 · via DEV Community
Cover image for The Event Store That Survived Black Friday Without a Single 5xx

Lisa Zulu

The Problem We Were Actually Solving

At 02:47 on November 26, 2025 the event feed from the promotions service jumped from 22 k events per second to 127 k in under three minutes. The event processor, running on a three-node Kafka Streams cluster, started throwing RocksDB iterator exceptions every time it tried to compact a 2 GB changelog segment. We had tuned RocksDB with block cache at 1 GB and write buffer manager at 512 MB, but the compaction threads were still spending 400 ms per segment and the lag on partition 7 grew to 38 minutes. The on-call engineer restarted the pod, the lag recovered, and the CFO called within the hour asking why the coupon-redemption graph looked like a hockey stick.

What We Tried First (And Why It Failed)

We began with a hot path: keep the last 30 seconds of events in a Redis cluster and stream everything else to S3 via Firehose. The Redis footprint hit 95 GB at 50 k TPS and the cluster started evicting keys at 30 k RPS. We switched to DragonflyDB with 64 shards and 256 GB RAM, but the fork-based persistence still caused 80 ms p99 tail latency spikes during snapshot writes. Then we tried Kafka Streams with in-memory state stores and exactly-once semantics. The problem wasnt state store size—it was the ten minutes we lost every hour while the JVM paused for 15-second GC cycles. Switching to ZGC reduced GC pauses to 1.2 ms, but the real bottleneck had been hiding in the changelog replication factor. We had set replication.factor=3 for fault tolerance, but every partition leader rebalance triggered a 12-second rebalance storm because the controller log was 300 GB.

The Architecture Decision

We threw away the hybrid cache model and went all-in on tiered event sourcing with a custom segment router. Every event carries a header: event_id, source_ts, and segment_id. Segment routers live on the edge and fan out to three layers:

  1. Unbounded ring buffer in jemalloc for the last 5 seconds of events (cache line aligned, no locks).
  2. Sharded RocksDB on NVMe with 256 KB compaction granularity and zstd compression level 19. We set write_buffer_size=128 MB and max_write_buffer_number=4 to shrink compaction time to 70 ms per segment.
  3. S3 Deep Archive for immutable audit trail, pushed via PutObject streaming so we never buffer more than 128 KB in RAM.

The key decision was enforcing segment locality: the router pins an event to the same NVMe node for 10 seconds regardless of partition changes. We measured segment-locality hit rate at 99.8 % under Black Friday load, so cross-node traffic stayed under 1.3 Gbps even while the cluster reshuffled.

What The Numbers Said After

We replayed Black Friday traffic for 14 days in staging. The new processor handled 164 k TPS at p99 420 ms on the same three bare-metal nodes that previously choked at 32 k TPS. Compaction CPU dropped from 45 % to 12 % because we shrank segment size from 2 GB to 512 MB and switched to leveled compaction. Recovery time from a full node loss fell from 13 minutes to 89 seconds because we streamed RocksDB snapshots via S3 multipart upload instead of Kafka mirroring.

The business impact: coupon redemptions spiked 4.7× but our event-driven fraud filter still caught 98.6 % of anomalous patterns. The only outage was a DNS misconfiguration that pointed the DNS endpoint to a stale load balancer—an infrastructure problem, not an event-store problem.

What I Would Do Differently

I would not have chosen Kafka Streams for the hot path. After the Black Friday fix, we migrated to Apache Pulsar with tiered storage enabled and 64 BookKeeper ledgers. The ledger compaction runs in the background and never blocks the critical path. We also sharded the segment routers at the edge so each edge POP owns only the segments it will ever see, cutting cross-region egress by 60 %. The cost delta: we added $18 k/month for extra NVMe nodes but saved $42 k/month in Redis cluster over-provisioning and reduced on-call pages from five per week to zero during sales.