惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Simon Willison's Weblog
Simon Willison's Weblog
G
Google Developers Blog
Spread Privacy
Spread Privacy
I
InfoQ
V
V2EX
S
Schneier on Security
小众软件
小众软件
C
CERT Recently Published Vulnerability Notes
博客园 - 聂微东
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Stack Overflow Blog
Stack Overflow Blog
T
Threat Research - Cisco Blogs
L
Lohrmann on Cybersecurity
Recent Announcements
Recent Announcements
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Attack and Defense Labs
Attack and Defense Labs
云风的 BLOG
云风的 BLOG
The Hacker News
The Hacker News
S
SegmentFault 最新的问题
C
Cybersecurity and Infrastructure Security Agency CISA
NISL@THU
NISL@THU
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
GbyAI
GbyAI
Latest news
Latest news
S
Secure Thoughts
Project Zero
Project Zero
MongoDB | Blog
MongoDB | Blog
I
Intezer
Security Latest
Security Latest
Apple Machine Learning Research
Apple Machine Learning Research
Vercel News
Vercel News
N
Netflix TechBlog - Medium
V2EX - 技术
V2EX - 技术
量子位
T
Threatpost
T
The Blog of Author Tim Ferriss
Y
Y Combinator Blog
T
Tor Project blog
A
Arctic Wolf
Microsoft Security Blog
Microsoft Security Blog
T
The Exploit Database - CXSecurity.com
大猫的无限游戏
大猫的无限游戏
T
Tailwind CSS Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
C
Check Point Blog
博客园 - Franky
Google DeepMind News
Google DeepMind News
The Register - Security
The Register - Security
The GitHub Blog
The GitHub Blog
L
LINUX DO - 热门话题

Stanislav’s Big Data Stream

postgres can be your data lake (pg_lake) MongoBleed explained simply Event Streaming is Topping Out How AWS S3 serves 1 petabyte per second on top of slow HDDs Why Was Apache Kafka Created? kafka community spotlight: TAIWAN 🇹🇼 The Brutal Truth about Kafka Cost Calculators meet your new data lakehouse: S3 Iceberg Tables What To Expect? Coming soon
What is Kafka Streams?
Stanislav Kozlovski · 2025-09-07 · via Stanislav’s Big Data Stream

Preface

After spending too much time on boring & dry docs describing Kafka Streams concepts, I want to explain it my own way. As concisely and clearly as possible while being technically acurate 👌

This should help you get an understanding of Kafka Streams in under 5 minutes.

Kafka Streams is a stream-processing library you embed in your apps.

import org.apache.kafka.streams.KafkaStreams;

It gives you a high level API to read data from Kafka, process it, and then write it back - it’s basically an abstraction above the regular KafkaProducer and KafkaConsumer classes, with a TON of extra processing, orchestration and stateful logic on top.

Streams lets you do lots of complex stuff. A simple example - count the last minute’s number of views per page in real time and filter out cases where it’s too high.

builder.stream("page-views")
       .filter((page, pageView) -> !isBot(pageView))
       .windowedBy(Duration.ofMinutes(1))
       .count()
       .toStream()
       .to("page-traffic-sums");

1

This code continuously counts the sum of human page views over the last minute and produces it to a new topic.

Here’s how the data would look like: 👇

Whenever Streams processes a new input message, a new output message is persisted in the output topic.

Any consumer that reads the `page-traffic-sums` topic will now know, in real-time (milliseconds), the latest sum of traffic in the last minute.

To stretch this example further, one could create a separate handler that reacts to high view load by creating an incident and triggering auto-scaling.

Instead of using the regular Consumer API - one could again simply use Streams!

 builder.stream("page-traffic-sums")
        .filter((page, viewCount) -> viewCount > 10_000)
        .foreach((page, alert) -> handleCriticalAlert(page, viewCount));

 private static void handleCriticalAlert(String page, Integer viewCount) {
        System.out.println("CRITICAL: " + page + " exceeded 10,000 views in the last minute!");
        sendSlackMessage(page, viewCount);
        updateLoadBalancer(page, viewCount);
        triggerAutoScaling(page, viewCount);
        createIncident(page, viewCount);
    }

The fact KStreams is a simple library makes it stand out. It’s much easier to write such stream processing apps that way - e.g simply import the library in your `autoscaler` app.

There are a few basic concepts one has to know about when discussing Kafka Streams:

  • Topology - any stream job we create consists of multiple steps, like reading data from the topic, filtering out results that match our predicate, summing the value, emitting the value to a Kafka topic.

    • The series of these steps is called a topology, and it’s a directed acyclic graph (DAG) → it only goes one way.

  • Processor - a step in the topology.

    • e.g the `stream(“page-views”)` method creates a Source Processor - one that reads data from the page-views Kafka topic and sends it to the next processor; the `filter()` method creates a Filter Processor, etc.

    • every Processor sends its output to another processor, besides the Sink Processor - which sends its output to a Kafka topic

Here is the full topology of our Streams job

The topology is the logical definition of the job.

The physical definition is a Stream Task.

  • Stream Task - a job executing a topology.

  • 1 Partition == 1 Stream Task - a stream task always works on one Kafka partition.2

    • Since Kafka topics are divided into partitions3, the unit of parallelism in Kafka Streams is the partition. The execution of stream processing logic on that partition is called a stream task.

    • Two Stream Tasks share nothing between each other - they’re independent.

      • e.g you can’t have two tasks executing on the same partition.

A Kafka Streams app creates as many Stream Tasks as there are partitions.4
This happens internally and is transparent to the user code.

Stream Tasks are CPU-heavy since a lot of data is being deserialized, crunched and serialized again.

A single CPU core can’t process two partitions truly in parallel, so we need… multi-threading!

  • Stream Thread - a thread on which Stream Tasks are run.

    • A Stream Thread can run many Stream Tasks.

    • One thread has a single consumer and producer client. Stream tasks within the thread share these Kafka clients.

    • For maximum performance, you want to have at least one thread per CPU core.

      • In cases where your stream-tasks are IO-bound - you want to have more threads than cores (calling external APIs)

Enough about threads - let’s not forget we’re talking about Big Data and Distributed Systems! Kafka was created because its data could not comfortably fit in a single machine, hence the need to shard data into partitions.

The same should apply doubly-so to Kafka Streams, since processing is a lot more work than simply storing data on disk. Let’s create a distributed system:

  • Stream App (Node) - an instance of the `KafkaStreams()` class. Practically speaking, you’d run one of these per node (VM)5.

    • These apps coordinate their work through Kafka.

A Kafka Streams cluster consisting of two stream tasks per thread; three threads per node; two nodes. Note there is one consumer/producer instance per thread, and each task is associated with one partition.

With any distributed system come a thousand problems:

  • what happens if a stream node dies? who takes the tasks and how?

  • what happens if a new stream node comes online? which tasks does it take over, from who and how?

  • what if you want another node to warm up (build up the same state), so as to allow for future failover?

  • how do you smoothly handle upgrades that change the coordination protocol’s schema?

  • how is partition progress persisted across nodes? if a new stream node takes over anothers’ work, how does it know from which offset to start?

There’s an elegant solution to this you may be familiar with - Kafka’s Consumer Group protocol! Every consumer in a Streams App uses the same consumer group id6, and together they form the overall Kafka Streams Consumer Group.

Within a group rebalance, stream tasks (partitions) are assigned to stream threads (consumer instances). This is how a distributed Kafka Streams application, with many nodes and threads, distributes work throughout the system.

In just ~4 minutes of read time, we covered:

  1. The KafkaStreams library

  2. Core concepts like a Topology, Processor and Stream Task

  3. How the framework distributes work between tasks, threads and apps (nodes)

Here’s a reference-able table to serve as a reminder:

TIP: Send to your team on Slack

Share

KStreams gets more complex the more you dig into it. Here are a few concepts we haven’t yet diven into:

  • Sub-Topologies

  • KTables

  • RocksDB

  • Changelog Topic

  • KTables

  • Global KTables

  • Standby Tasks

  • Repartitioning

  • Different types of windowing

  • Exactly Once

  • Interactive Queries