惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
Martin Fowler
Martin Fowler
T
Threatpost
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
C
CERT Recently Published Vulnerability Notes
V
Vulnerabilities – Threatpost
Help Net Security
Help Net Security
Project Zero
Project Zero
博客园 - 聂微东
博客园_首页
T
Tor Project blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
Visual Studio Blog
人人都是产品经理
人人都是产品经理
The Register - Security
The Register - Security
Latest news
Latest news
K
Kaspersky official blog
L
LINUX DO - 热门话题
P
Proofpoint News Feed
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
C
Cyber Attacks, Cyber Crime and Cyber Security
A
Arctic Wolf
aimingoo的专栏
aimingoo的专栏
J
Java Code Geeks
F
Full Disclosure
Recent Announcements
Recent Announcements
SecWiki News
SecWiki News
C
Cybersecurity and Infrastructure Security Agency CISA
F
Fortinet All Blogs
The Hacker News
The Hacker News
Apple Machine Learning Research
Apple Machine Learning Research
NISL@THU
NISL@THU
The GitHub Blog
The GitHub Blog
量子位
Hugging Face - Blog
Hugging Face - Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
P
Palo Alto Networks Blog
T
Troy Hunt's Blog
O
OpenAI News
T
Threat Research - Cisco Blogs
博客园 - Franky
Hacker News - Newest:
Hacker News - Newest: "LLM"
A
About on SuperTechFans
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tenable Blog

Redis

Real-Time Fraud Detection: Latency, Features & Scale Context window in AI: why every token is a budget decision Connecting to Redis Cloud with AWS PrivateLink vs. VPC peering | Redis Redis Data Integration in Redis Cloud is now GA in AWS | Redis Why AI Misses Business Context & How Teams Fix It AI Reasoning Explained: Why Context Matters Semantic Layer vs Context Layer: Key Differences Redis array data type: How it works and when to use it Context Graphs vs. Vector Search: When RAG Falls Short What’s new in two – May 2026 edition Redis 8.8 performance improvements: Faster string, hash, streams, SCAN & more Redis 8.8: New array data structure & open source features How Conflict-free Replicated Data Types power active-active database replication Context Orchestration: What It Is & How It Works Context Compaction for AI Agents: A Complete Guide Prompt Bloat: Causes, Costs & Fixes for LLM Apps Agentic Retrieval Techniques: A Complete Guide Single-shot reliable consumers with XREADGROUP CLAIM in Redis 8.4 | Redis Long-Horizon AI Agents: Memory & State Infrastructure What is a context engine? What Is a Context Layer? AI Agent Infrastructure Context Retrieval for AI Agents: What It Is & Why It Matters Context Poisoning: How Bad Data Breaks Agent Reasoning Context is all you need: Introducing Redis Iris | Redis Context Engineering for AI: What It Is & How to Build It Dynamic endpoints: Migrate databases without changing your endpoint | Redis AI Shopping Assistants: How They Work & What to Build Endless Aisle Retail: Infrastructure & Real-Time Data LLM Speed Benchmarks: Metrics & Infrastructure Guide Context Pruning: Cut LLM Tokens Without Losing Quality What’s new in two – April 2026 edition Agentic AI Architecture: 5 Patterns Explained AI Agent vs Chatbot: Key Differences Explained Advantages of Building a Vector Search Solution API Latency in LLM Apps: Causes & How to Fix It Security advisory: [CVE‑2026‑23479] [CVE‑2026‑25243] [CVE-2026-25588] [CVE‑2026‑25589] [CVE-2026-23631] | Redis Edge Computing Latency: Causes & How to Reduce It AI Agents vs Workflows: When to Use Each Active-Active vs Active-Passive Database Architecture Prefill vs Decode: LLM Inference Phases Explained Long-Term Memory Architectures for AI Agents Time to First Byte Test: Tools, Causes & Fixes Speculative decoding: how it works & when to use it P95 Latency: What It Is & Why It Matters Why Multi-Agent LLM Systems Fail & How to Fix Them AI Human in the Loop: Production Oversight Patterns Native OpenTelemetry metrics for Redis client libraries | Redis Client-side geographic failover for Redis Active-Active | Redis Use Redis with SQL | Redis Introducing Redis Feature Form Build Google ADK Agents with persistent, real-time memory on Redis | Redis Startup Spotlight: Neuron Systems API Throttling: Algorithms, Patterns & Mistakes Agentic AI Examples Across 6 Industries Best Chunking Strategies for RAG Pipelines Agentic AI Guardrails: Controls That Work Redis joins AWS at GDC to support the next generation of gaming | Redis Designing a semantic routing system: From static rules to dynamic intelligence with Redis and Java | Redis Real-Time Dispatch System: A Complete Guide P99 Latency: What It Means & How to Fix It Tokenization in LLMs: What AI App Devs Need to Know TTFT Meaning: What is Time to First Token? Atomic slot migration with Redis 8.4 Hybrid search benefits: Why your RAG system needs both keyword & vector search What’s new in two: March 2026 edition Vector embedding generators: How they work & how to use them Throughput-optimizing Redis for L2 KV Cache Reuse What is a data pipeline? Building AI agent pipelines that don't forget, fail, or fall apart Redis achieves Google Cloud Ready, Distributed Cloud status ahead of Google Cloud Next ‘26 | Redis Real-time network monitoring: what your data platform needs to keep up AI agent API: How agents connect to the real world What is multicloud infrastructure? A guide for 2026 What is a transaction monitoring system & how does it work? Why your AI agent fails in production & how tracing helps AI agent benchmarks: Where they fall short & why your infrastructure matters What is a JSON database (and when should you use one)? Introducing the Redis Partner Network: A new foundation for real-time innovation How real-time customer segmentation works in retail Payment orchestration & vault architecture in retail Agentic systems vs. GenAI: when generation isn't enough What is fuzzy matching? Semantic caching & routing: two powerful patterns for vector classification Redis alternatives: Why there are no exact substitutes Connect to Azure Managed Redis with Redis Insight 3.2.0 How to tame the thundering herd problem Redis to Manage Storage Replication | Redis How hierarchical navigable small world (HNSW) algorithms can improve search | Redis How leading financial institutions use Redis to drive growth | Redis What’s new in two: May 2025 | Redis Introducing Model Context Protocol (MCP) for Redis | Redis Redis vs. Elasticsearch: What’s faster for GenAI & vector search? | Redis Build fast, production-worthy AI apps with Spring AI and Redis | Redis Azure Managed Redis is GA today | Redis Redis then & now: Adapting with developers through every era | Redis Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability | Redis What’s new in two: April 2025 | Redis Redis 8 is now GA, loaded with new features and more than 30 performance improvements | Redis What is a data strategy? 6 key components explained Data replication explained: types, examples & use cases
Streaming LLM Responses: Make Your AI App Feel Fast
Redis · 2026-04-30 · via Redis

Watch someone use a ChatGPT-style app for the first time and you'll notice they start reading before the response is finished. That reading-as-it-appears behavior is the whole reason streaming exists. It turns a multi-second wait into something that feels like a conversation, even when the underlying generation time hasn't budged.

The first tokens appear within a second or two, even when the full response takes 10 or 15. Streaming uses that gap to keep users reading while the model finishes the rest.

This guide covers what streaming LLM responses are, why they feel faster, and how to combine streaming with caching to make AI apps feel responsive.

What is a streaming LLM response?

That perceived-speed benefit makes more sense once you look at how streaming works at the transport layer. By default, APIs return full responses. You wait for the whole thing, then get it all at once as a single HTTP payload. For a 500-token response, that can mean several seconds of staring at a blank screen.

Instead of waiting for the full response, the server sends each token to the client as soon as it's generated. This works because LLMs are autoregressive: they generate one token at a time, with each new token depending on everything that came before it. Since generation is already sequential, the server can emit each token immediately rather than buffering the whole sequence.

Most streaming APIs deliver tokens using Server-Sent Events (SSE), a standard way for a server to push data to a client over a single HTTP connection. SSE is unidirectional (server to client only), which is usually all you need for token delivery. Under HTTP/1.1, it's commonly carried via chunked transfer encoding, so the server can start sending data before it knows the total response length. HTTP/2 streams the same kind of data through its own native frames instead.

Turning streaming on is typically a small change, not an architectural one. In most SDKs it's a single parameter (such as stream=True in Python), and the response shape shifts from a single complete message to a series of incremental updates the client renders as they arrive.

The metrics that matter: TTFT & TPOT

The metric that matters most for streaming is Time to First Token (TTFT), the time between submitting a request and seeing the first piece of output. TTFT is what users experience as "the wait." A complementary metric, Time Per Output Token (TPOT), measures the average time between tokens after the first. Together, they describe the two phases of an LLM response users actually feel: how long before anything shows up, and how fast the rest flows.

Redis Streams

Make your AI apps faster and cheaper

Cut costs by up to 90% and lower latency with semantic caching powered by Redis.

Why streaming LLMs feel faster than they are

The transport-layer view explains how streaming works. What it doesn't explain is why the gap between actual and perceived speed is so wide.

Common UX guidelines suggest that under 0.1 seconds feels instantaneous, under 1 second keeps a user's flow of thought uninterrupted, and 10 seconds is roughly the outer limit for maintaining attention. That makes TTFT especially important in practice.

The mechanism behind this is sometimes called the progress bar effect. In one study, an optimized progress bar design made processes feel 11% faster than a plain version. In another, progress bars with more frequent steps led users to underestimate elapsed time.

The effect on patience can also be impressive. In one experiment, users with a moving progress bar were willing to wait about 3 times longer than those with no indicator. There's an important nuance here, though. Streaming's value is making the wait feel productive, beyond just shortening how long it seems.

Streaming vs. other LLM optimization levers

Streaming sits at the presentation layer. Other levers sit underneath it, reducing actual compute time, and they compose well with streaming on top. Here's how the main ones compare:

  • Speculative decoding uses a small draft model to generate candidate tokens that a larger model verifies in parallel. That can cut the time between tokens during decoding, but it doesn't reduce TTFT.
  • Quantization reduces model weight precision, cutting memory bandwidth per decode step.
  • Continuous batching dynamically adds new requests to an active batch as ongoing generations finish, reducing GPU idle time. That can make it an important throughput lever for high-concurrency inference workloads.
  • Prefix caching reuses previously computed attention key-value pairs for repeated prompt prefixes. That matters most for the wait before the first visible output, especially with long prompts.
  • Semantic caching stores full LLM responses keyed by query meaning, bypassing the model entirely on cache hits.

Those levers affect different parts of the latency path. In practice, the right mix depends on whether your bottleneck is first-token delay, generation speed, or repeated work. Streaming doesn't replace them; it makes their gains visible to the user in real time.

When should you use streaming LLM responses?

The deciding factor is whether a human is watching the output appear in real time. When they are, streaming is almost always worth turning on. When they aren't, the overhead usually isn't worth it.

Chat and conversational AI interfaces are the clearest fit. Fast TTFT is important for real-time feel, and users naturally read along as tokens arrive. Code generation tools are another strong match: developers read the generated code as it streams in and can cancel early if the model goes off track.

Batch processing is the opposite scenario. If you're running evaluations, classifying large datasets, or embedding content repositories, batch APIs can offer lower cost than synchronous calls with a longer turnaround time. For those workloads, many APIs trade interactivity for lower cost, so streaming is usually not the priority.

A few situations call for non-streaming even in interactive apps. Content moderation is explicitly harder with streaming: partial completions are more difficult to evaluate for policy compliance. And if your app needs strict structured JSON output, parsing incomplete JSON chunks as they arrive adds extra complexity.

How streaming changes your LLM app architecture

Once you've decided streaming fits the UX, the trade-offs lead straight into architecture. Flipping streaming on in your SDK is a one-line change, but making it work reliably in production touches every layer of your stack. The issues below are the ones most teams run into first.

The reverse proxy problem

The most common failure mode is invisible: your reverse proxy buffers the complete upstream response before forwarding it, making streaming silently degrade back to batch-like behavior. Default proxy timeouts are also often too short for longer LLM generations, which can cut streams off mid-response. Compression middleware can create a similar issue by buffering output before it reaches the client. The fix is conceptual rather than config-specific: anything between your app and the user that buffers or compresses by default has to be told not to on streaming routes.

Error handling mid-stream

That transport choice also changes how you handle failures. Once you've sent the HTTP 200 OK header and started streaming, you generally can't use another HTTP status code to signal errors. Errors that happen mid-generation have to be sent as stream events instead, and your frontend has to distinguish a dropped connection from an error the server reported inside the stream. Otherwise, a partial response looks like a successful one.

Connection management at scale

Open connections add state, and that shows up fast at scale. Each streaming client holds an open connection, and if a client reconnects after a drop, it may land on a backend instance that has no memory of that session. The SSE spec supports resumption, but your backend has to implement it. A decoupled architecture, where partial output lives in an intermediate store rather than in-memory on a single instance, lets any backend serve a reconnecting client without losing what's already been generated.

Apps

Give your AI apps real-time context

Run them on Redis for AI, built for fast retrieval and low-latency responses.

Optimizing perceived speed: combining streaming with caching & context

With streaming reliably wired up, caching is the next lever for making AI apps feel fast. Streaming can't make a slow generation finish faster, but caching can sidestep generation entirely on hits. The two techniques complement each other well once you understand how they interact.

Streaming and caching have a natural tension: streaming emits tokens incrementally, but caching needs a complete response to store. The common production pattern resolves this by doing both. On a cache miss, the app streams tokens to the client in real time and asynchronously stores the full response once the stream finishes. On a cache hit, the app returns the complete cached response instantly and skips streaming entirely, because there's nothing to wait for.

Semantic caching widens the door on hits. The pair "What are the features of Product A?" and "Tell me about Product A's features" can map to the same cache entry, turning one cached answer into a hit for many phrasings of the same intent.

This is where Redis fits in an AI stack. Redis provides sub-millisecond latency for many AI workloads, and Redis LangCache adds semantic caching as a managed capability: converting queries to vector embeddings, comparing them against previously cached queries, and returning a cached response when the match is close enough. In benchmarks, cache hits were up to 15x faster, with up to 73% lower LLM inference costs without code changes.

Context optimization is the other lever for cache misses. In retrieval-augmented generation (RAG) systems, where your app retrieves relevant documents and passes them to the LLM as context, prompt compression techniques like LLMLingua-2 shrink the prefill token count, which reduces TTFT. One benchmark reported prompt processing dropping to 7.5s at 2x compression on a V100 GPU. Ordering your prompt so static content comes before dynamic content also helps the inference engine reuse prefix computations across requests.

Redis Iris

Now see how this runs in Redis

Power AI apps with real-time context, vector search, and caching.

The fastest token is the one you don't generate

Streaming makes your AI app feel responsive. Caching makes it faster on repeated work. The best production architectures combine both: stream tokens on cache misses so users see immediate progress, and serve cached responses instantly on hits so they skip the wait entirely.

Redis for AI gives you a single real-time data layer for this pattern: native vector search for RAG retrieval, semantic caching through Redis LangCache to cut redundant LLM calls, and the data structures your app already uses for session state and real-time coordination. Instead of running a separate vector database, cache, and operational store, you get all three in one platform.

If you're building an LLM-powered app and want to see how semantic caching and vector search hold up against your workload, try Redis free. For help designing a streaming architecture that fits your scale, contact Redis.