How to analyze the cost of Kafka?

Which side are you on: "This is just what Kafka costs at scale" or "We should switch to a cheaper Kafka provider"?

At Conduktor, our field team works inside Kafka environments that have been running for a long time. We see this: most Kafka teams are overpaying by 25 to 40 percent. Not because anyone did anything wrong, but because of how Kafka got built up over time.

The cost drivers of Kafka are weirdly context-dependent: the infrastructure and the provider are a tiny part of the full picture.

The "how" it's being used is the real question.

Five bad patterns eating budget

Below is what see, the same patterns show up everywhere, and are the first things we work with our customers.

1. Partition overprovisioning

"How many partitions?" is the most common question with Kafka. I heard last week someone telling me an org just defaults to "64". I was shocked. Not only providers may price per partitions, but from a Kafka point of view: this takes metadata and open files etc.

Partitions depend on throughput and concurrency expected (consumer parallelism). If a 64-partitions topic is sitting in a cluster with barely no traffic, you're just losing money on all sides. Multiply by dozens or hundreds of topics at scale.

2. Retention that makes no sense

Long retention on topics that nobody reads past the last few hours. Do you need replay? Default is 7-day retention, but it's often applied uniformly, when some topics only need a couple of hours and others genuinely need weeks.

Tips: when using compacted topics and/or Kafka streams (changelog etc.), data is being stored indefinitely, that can cause some security/regulations issues.

3. Let's spin up another cluster

One-cluster-per-team was a reasonable isolation strategy a long time ago. We saw this multiple times, more than 500 clusters, with tons of mirroring to share data. Throwing money down the drain.

You're paying for underutilized clusters instead of consolidating onto fewer well-managed ones.

4. Zombie topics

Topics created for experiments, migrations, or one-off tests that were never cleaned up. It's a simple thing but cost so much money as no one is looking. Every one of them is replicated and has retention costs. We've seen enterprises with hundreds of zombie topics, who were so surprised when we showed them.

5. Runaway egress

We had a customer where egress was running 30x higher than ingress on a single topic because of a misconfigured consumer. Buggy consumers, unnecessary fan-out, and chatty clients create traffic patterns that are invisible without dedicated infra monitoring. Egress is rarely free.

How to deal with it

Pick your starting point based on where the waste is concentrated.

Stop the bleeding: better defaults

Low-coordination work that pays off over time. It's better to have exceptions rather than wrong defaults you can't rollback.

Set sensible low partition defaults (3) and short retention (1 day). Increase if necessary only.
Enforce client-side compression. (Conduktor Gateway)
Require ownership metadata at topic creation. (Conduktor)

This won't reduce your bill right away, but it will prevent it from getting worse.

Trim the fat: optimize what's running

Tune retention where it's drifted, analyze consumer patterns.
Retire topics with no active producers or consumers.
Right-size partition counts (this is the hard one, since it means recreating topics and coordinating with every producer and consumer). - Consolidate Kafka clusters, introduce multi-tenancy (Conduktor)

This work easily moves the infrastructure bill, we saw reductions of $500k just doing this.

Now, keep it clean, be disciplined

After a cleanup, the same "drift" will start operating again.

To help you keeping the direction, have absolute visibility into what you Kafka ecosystems contains and what it costs (chargeback is powerful for this), clear ownership so every topic and cluster has a team accountable for it, and a regular review cadence to catch drift before it becomes permanent. Not heavyweight governance. Just enough discipline that the cleanup doesn't have to be repeated every year.

Where to start

The diagnostic question is simple: which of these patterns are present in your environment, and what are they costing you?

The original deep-dive goes further into the four layers of Kafka cost (infrastructure, ecosystem tooling, vendor/licensing, and operational) and includes a framework for sequencing the work.

If you want to look at your own estate, Conduktor's field team does a free cost analysis where they walk through your environment with you and give you concrete numbers.

推荐订阅源

DEV Community