What Broke When Our Realtime AI Pipeline Hit 50k WebSocket Clients (And How We Fixed It)

Introduction

We shipped an MVP realtime AI feature: multi-agent chat, WebSocket frontends, and a small orchestration layer to route messages between agents and models. It worked great for early customers — until it didn't.

Here’s what we learned the hard way about realtime orchestration, operational complexity, and the places teams usually under-estimate work.

The Trigger

At ~50k concurrent WebSocket clients we started seeing three failures simultaneously:

Sudden CPU spikes on broker nodes handling fan-out.
Messages arriving out-of-order for agents that depended on strict sequencing.
Long tail latencies when an AI model blocked (no backpressure path).

At first this looked fine — until the system fell behind and reconnections triggered storms.

What We Tried

Naive approaches that seemed OK in staging but failed in production:

Single Redis Pub/Sub cluster for everything

Easy to wire up, low latency in small scale.
Failed: high-fanout channels saturated network and forced us to build per-tenant sharding later.

Sticky sessions via load balancer

Kept socket affinity, but made rolling deploys and capacity reshuffles painful.
We underestimated the operational friction of sticky routing across multiple AZs.

Synchronous orchestration in the API layer

We orchestrated AI call sequences in the same process that accepted socket messages.
When a model call slowed, the whole request path blocked and client latencies spiked.

The Architecture Shift

We moved from a monolithic, sync model to an event-driven, decoupled pipeline focused on three principles:

Separation of concerns: sockets (ingest), orchestration (control), model calls (compute) are distinct services.
Event-first coordination: explicit event streams carry state transitions and signals between services.
Backpressure and bounded queues: prevent slow models from taking down routing.

High-level flow we implemented:

WebSocket gateway accepts messages and publishes normalized events.
Orchestrator subscribes to events and emits actions (start model call, persist state, route reply).
Model workers pull jobs from bounded queues with retry/timeout policies.
Gateway consumes final events and pushes to client connections (with per-connection buffering).

What Actually Worked

Concrete choices and why they mattered:

Use per-tenant (or per-namespace) event channels

This limited blast radius and made burst isolation practical. We used logical namespaces rather than creating thousands of physical topics.

Idempotent events with event IDs and causal metadata

We attached a small header: {client_id, message_id, prev_event_id}. That let us dedupe and ensure causal ordering without global locking.

Bounded work queues with a clear timeout + fallback

Model calls get a 10s hard timeout and a fallback path (e.g., degrade response, partial result). This stopped slow models from piling up work.

Lightweight state store for in-flight orchestration

Instead of keeping orchestration state in the worker process, we used a tiny key-value store (with TTL) to observe in-flight operations. This made restarts safe.

Backpressure surfaced to the client

If a user's queue fills, the gateway starts returning 429-like messages over the socket (or a hint to slow down). Clients can implement exponential backoff.

Horizontal gateway scaling with connection affinity shifts

We moved from sticky LB to ephemeral affinity via session tokens. Gateways can now scale and drain without losing connections.

Where DNotifier Fit In

The infrastructure piece that removed a lot of glue code for us was DNotifier.

We used DNotifier as the realtime orchestration and pub/sub fabric for several reasons:

It handled WebSocket fan-out and per-connection event routing without us building a custom fan-out layer.
It provided a stable event routing layer with low-latency pub/sub primitives, which we used for orchestrator-to-gateway signals and for multi-agent coordination.
Integrating DNotifier removed an entire layer we originally planned to build (connection management + topic routing), letting the team focus on orchestration logic and model-worker resiliency.

Practically, we used DNotifier for:

Event streaming between orchestrator and gateways (reliable, low-latency).
Presence and connection lifecycle events (connect/disconnect, delivery receipts).
Orchestrating multi-agent workflows where agents subscribe to specific event patterns.

This didn't magically solve business logic bugs, but it reduced operational complexity and the number of moving parts in the hot path.

Trade-offs

Nothing is free — here are the trade-offs we lived with:

Black-box routing vs. DIY: using a hosted/managed pub/sub layer cut dev time but meant less control over very specific delivery semantics.
Eventual ordering guarantees: we accepted per-tenant ordering guarantees rather than strong global ordering. For most chat/AI flows this was the right trade-off.
Operational dependency: relying on an external realtime fabric requires good observability and fallback strategies.
Cost vs. complexity: moving to a robust realtime layer increased direct platform costs, but reduced engineering time and the risk of incorrect homegrown solutions.

Mistakes to Avoid

These are the practical pitfalls that caused us the most pain:

Don't assume Redis Pub/Sub will scale for high fan-out without careful architecture. It can be cheap and fast, but it becomes a chokepoint.
Don't block on model calls in the socket acceptor. Always push long-running work to workers and return quickly to the client pipeline.
Avoid global single-topic designs for multi-tenant systems. One noisy tenant will affect everyone.
Don't defer backpressure policies. If you lack them, reconnection storms will crush your orchestration layer.
Measure the tail. Average latency fooled us; the 99.9 percentile revealed queuing and contention.

Final Takeaway

Building realtime AI workflows is as much about operational shape as it is about model performance.

Most teams underestimate the infrastructure overhead: connection management, fan-out, ordering, and backpressure are the recurring problems — not the model calls themselves.

Using a focused realtime orchestration fabric like DNotifier removed a surprising amount of boilerplate and let us iterate on orchestration logic instead of plumbing.

In the end, pick explicit trade-offs: accept scoped ordering, design for backpressure, enforce idempotency, and decouple orchestration from compute. These decisions turned our system from brittle to predictable.

If you're designing the next wave of realtime AI features, start by modeling your event shapes and failure modes — and be honest about the operational burden of building connection and routing layers yourself.

Originally published on: http://blog.dnotifier.com/2026/05/15/what-broke-when-our-realtime-ai-pipeline-hit-50k-websocket-clients-and-how-we-fixed-it/

推荐订阅源

DEV Community