





















Every LLM request runs in two distinct phases: prefill, where the model reads your prompt in one parallel burst, and decode, where it generates the response one token at a time, each one depending on the last. These two phases have different performance characteristics, hit different hardware bottlenecks, and need different optimization strategies.
If your chatbot feels sluggish before the first word appears, that's usually a prefill problem. If it crawls once tokens start coming, that's decode. This guide covers what prefill and decode mean, how they affect time to first token (TTFT) and inter-token latency (ITL), and which optimization levers matter for each phase.
Prefill and decode stress different parts of the GPU, which is why a single optimization rarely improves both:
Specifically, prefill is usually compute-bound, meaning it's limited by how fast the GPU can do math. Decode is usually memory-bandwidth-bound, meaning it's limited by how fast the GPU can move data around. That difference shapes serving decisions like request scheduling and hardware allocation.
Take prefill first. The model reads your entire prompt in one shot, processing every token in parallel. That parallel nature means prefill can fully use the GPU's compute power, but the total work still scales with prompt length. That's what makes it compute-bound.
Attention is the catch. Every token in your prompt has to interact with every other token, so the work grows faster than the prompt itself. Doubling a long prompt from 16K to 32K tokens roughly quadruples the attention work. All that parallel math is what makes prefill compute-bound: the more tokens, the more the GPU has to crunch through.
One Llama 3.1 70B benchmark shows the pattern. TTFT rose with prompt length and scaled more than linearly at the longest contexts tested:
TTFT grew faster than the prompt itself, and that gap widens as contexts get longer.
From a user's perspective, prefill is the wait between sending a request and seeing the first token. That wait is what TTFT measures. In a streaming chat interface, it's the blank-screen pause before any text appears.
Prefill has to finish before the model can emit anything, so TTFT mostly reflects prefill runtime plus queuing and network overhead. For short prompts, that's usually near-instant. But for retrieval-augmented generation (RAG) workflows that prepend thousands of context tokens, or long-context chats that include the full conversation history, users can wait multiple seconds before the response starts.
After prefill, decode takes over with a slower kind of work. Each token has to be generated one at a time, with each one depending on the ones before it. That sequential nature means decode can't use parallel hardware the way prefill can, and it spends a lot of time moving data around. That's what makes it memory-bandwidth-bound.
Every decode step depends on every prior token, so the model has to remember the full context. That memory is the KV cache. It starts at the size of your prompt and grows by one entry per generated token. At scale, with long responses across many concurrent requests, the cache can balloon to several times the size of the model itself. Every decode step has to read all of that, which is a big reason decode is memory-bandwidth-bound.
From a user's perspective, decode is what they see as the response streams in. The time between each token is what ITL measures. In a streaming chat interface, low ITL feels like smooth typing while high ITL feels like pauses between words.
Decode also dominates total response time. A 500-token response with 80 ms ITL spends about 40 seconds in decode alone. TTFT might add another 200 ms, but that's negligible next to the decode total. The longer the output, the more decode drives the experience.
Cut costs by up to 90% and lower latency with semantic caching powered by Redis.
Because both phases compete for the same GPU, optimizing for one can degrade the other. Prefill requests can block decode streams and cause visible stuttering for users already receiving tokens. Long prefill workloads can also delay incoming requests and inflate TTFT.
That's why scheduling matters. An early policy in one inference framework prioritized prefills to improve TTFT but starved decode and slowed ITL.
Which phase hurts more depends on your workload's input-to-output ratio, the quickest signal for where latency will show up:
That pattern gives you a practical way to map user complaints to the phase that's probably slow. One optimization strategy rarely helps every LLM workload equally.
That makes diagnosis the next step. Figure out which phase is actually slow before you optimize. The wrong fix wastes engineering time and can make the other phase worse. Three signals will get you most of the way to a diagnosis: sequence length distribution, TTFT, and ITL.
Your request length distribution is the fastest signal for which phase matters more. Long inputs with short-to-moderate outputs suggest a prefill bottleneck. Short inputs with long outputs point to decode. If both are long, both phases are likely stressed. Batch size, model architecture, and semantic features shift the picture too, but input/output length is where to start.
TTFT and ITL each diagnose a different phase, and you need both. TTFT is the best proxy for prefill latency, though it also includes queuing and network overhead. If TTFT is high and scales with input length, prefill is likely the constraint. ITL is the decode diagnostic, calculated as end-to-end latency minus TTFT, divided by output tokens minus one. End-to-end latency alone hides which phase is slow.
Once you know which phase is slow, the kind of fix follows:
This won't diagnose every workload, but it's a strong first pass before deeper profiling.
If diagnosis points to prefill, you have two options: make prompt processing cheaper, or avoid it entirely. Efficient attention kernels fall in the first camp. Semantic caching falls in the second.
A class of optimizations called efficient attention (FlashAttention is the best-known) reorganizes how the model processes long prompts to make prefill faster. The model produces the same output, just faster. Many modern inference frameworks ship FlashAttention by default, so you may already have it.
Semantic caching operates at the app layer, above the inference pipeline. It caches complete LLM responses and reuses them when a new query is semantically equivalent to a previous one, regardless of exact wording. On a cache hit, the query never reaches the model.
Under the hood, semantic caching is a vector search problem. Incoming queries are embedded as vectors, compared against cached query vectors using a similarity metric, and served from cache when the similarity exceeds a configured threshold. Redis combines vector search with sub-millisecond reads in one real-time data platform, which is exactly what semantic caching needs. Redis LangCache, currently in public preview, is a managed semantic caching service built on Redis.
Semantic caching is often confused with prefix caching, but the distinction is simpler than it sounds: semantic caching can bypass inference entirely on a cache hit, while prompt-prefix reuse only reduces repeated prompt-processing work. They're complementary techniques, not substitutes. That makes semantic caching one of the few optimizations that can erase prefill work instead of merely shrinking it.
Run them on Redis for AI, built for fast retrieval and low-latency responses.
Decode optimizations work differently. They live inside the model-serving stack, not the data layer. The goal is to move less data per step or get more work out of each memory load.
Speculative decoding uses a small, fast draft model to guess what tokens the main model would produce, then has the main model verify them in parallel. When the guesses are right, you get multiple tokens for the cost of a single decode step. One benchmark on Llama 3.3 70B reported a 3.55× speedup. The catch: the draft model's own latency becomes the new bottleneck, so it has to be fast first and accurate second.
Quantization shrinks the numeric representation of the model's data, using fewer bits per number to store roughly the same information. Less data means less to move on every decode step. Different formats trade off accuracy and speed differently:
The common thread across decode levers is simple: make each generation step ask less of memory. These optimizations pair well with app-layer techniques like semantic caching, which avoids the decode step entirely on a cache hit.
Power AI apps with real-time context, vector search, and caching.
If you don't know which phase is slow, you're probably tuning the wrong thing. Long waits before the first token point to prefill, while slow streaming after that points to decode. Most LLM apps need to watch both TTFT and ITL, because they shape UX in different ways. The optimization families for each phase rarely overlap.
Redis is a real-time data platform for low-latency AI infrastructure. By combining vector search with semantic caching, it can bypass inference entirely on cache hits, eliminating both prefill and decode costs.
Try Redis free to see how semantic caching and vector search perform with your workload, or talk to our team about optimizing your AI infrastructure.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。