惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
Martin Fowler
Martin Fowler
T
Threatpost
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
C
CERT Recently Published Vulnerability Notes
V
Vulnerabilities – Threatpost
Help Net Security
Help Net Security
Project Zero
Project Zero
博客园 - 聂微东
博客园_首页
T
Tor Project blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
Visual Studio Blog
人人都是产品经理
人人都是产品经理
The Register - Security
The Register - Security
Latest news
Latest news
K
Kaspersky official blog
L
LINUX DO - 热门话题
P
Proofpoint News Feed
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
C
Cyber Attacks, Cyber Crime and Cyber Security
A
Arctic Wolf
aimingoo的专栏
aimingoo的专栏
J
Java Code Geeks
F
Full Disclosure
Recent Announcements
Recent Announcements
SecWiki News
SecWiki News
C
Cybersecurity and Infrastructure Security Agency CISA
F
Fortinet All Blogs
The Hacker News
The Hacker News
Apple Machine Learning Research
Apple Machine Learning Research
NISL@THU
NISL@THU
The GitHub Blog
The GitHub Blog
量子位
Hugging Face - Blog
Hugging Face - Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
P
Palo Alto Networks Blog
T
Troy Hunt's Blog
O
OpenAI News
T
Threat Research - Cisco Blogs
博客园 - Franky
Hacker News - Newest:
Hacker News - Newest: "LLM"
A
About on SuperTechFans
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tenable Blog

Redis

Real-Time Fraud Detection: Latency, Features & Scale Context window in AI: why every token is a budget decision Connecting to Redis Cloud with AWS PrivateLink vs. VPC peering | Redis Redis Data Integration in Redis Cloud is now GA in AWS | Redis Why AI Misses Business Context & How Teams Fix It AI Reasoning Explained: Why Context Matters Semantic Layer vs Context Layer: Key Differences Redis array data type: How it works and when to use it Context Graphs vs. Vector Search: When RAG Falls Short What’s new in two – May 2026 edition Redis 8.8 performance improvements: Faster string, hash, streams, SCAN & more Redis 8.8: New array data structure & open source features How Conflict-free Replicated Data Types power active-active database replication Context Orchestration: What It Is & How It Works Context Compaction for AI Agents: A Complete Guide Prompt Bloat: Causes, Costs & Fixes for LLM Apps Agentic Retrieval Techniques: A Complete Guide Single-shot reliable consumers with XREADGROUP CLAIM in Redis 8.4 | Redis Long-Horizon AI Agents: Memory & State Infrastructure What is a context engine? What Is a Context Layer? AI Agent Infrastructure Context Retrieval for AI Agents: What It Is & Why It Matters Context Poisoning: How Bad Data Breaks Agent Reasoning Context is all you need: Introducing Redis Iris | Redis Context Engineering for AI: What It Is & How to Build It Dynamic endpoints: Migrate databases without changing your endpoint | Redis AI Shopping Assistants: How They Work & What to Build Endless Aisle Retail: Infrastructure & Real-Time Data LLM Speed Benchmarks: Metrics & Infrastructure Guide Context Pruning: Cut LLM Tokens Without Losing Quality What’s new in two – April 2026 edition Agentic AI Architecture: 5 Patterns Explained AI Agent vs Chatbot: Key Differences Explained Advantages of Building a Vector Search Solution Security advisory: [CVE‑2026‑23479] [CVE‑2026‑25243] [CVE-2026-25588] [CVE‑2026‑25589] [CVE-2026-23631] | Redis Edge Computing Latency: Causes & How to Reduce It AI Agents vs Workflows: When to Use Each Streaming LLM Responses: Make Your AI App Feel Fast Active-Active vs Active-Passive Database Architecture Prefill vs Decode: LLM Inference Phases Explained Long-Term Memory Architectures for AI Agents Time to First Byte Test: Tools, Causes & Fixes Speculative decoding: how it works & when to use it P95 Latency: What It Is & Why It Matters Why Multi-Agent LLM Systems Fail & How to Fix Them AI Human in the Loop: Production Oversight Patterns Native OpenTelemetry metrics for Redis client libraries | Redis Client-side geographic failover for Redis Active-Active | Redis Use Redis with SQL | Redis Introducing Redis Feature Form Build Google ADK Agents with persistent, real-time memory on Redis | Redis Startup Spotlight: Neuron Systems API Throttling: Algorithms, Patterns & Mistakes Agentic AI Examples Across 6 Industries Best Chunking Strategies for RAG Pipelines Agentic AI Guardrails: Controls That Work Redis joins AWS at GDC to support the next generation of gaming | Redis Designing a semantic routing system: From static rules to dynamic intelligence with Redis and Java | Redis Real-Time Dispatch System: A Complete Guide P99 Latency: What It Means & How to Fix It Tokenization in LLMs: What AI App Devs Need to Know TTFT Meaning: What is Time to First Token? Atomic slot migration with Redis 8.4 Hybrid search benefits: Why your RAG system needs both keyword & vector search What’s new in two: March 2026 edition Vector embedding generators: How they work & how to use them Throughput-optimizing Redis for L2 KV Cache Reuse What is a data pipeline? Building AI agent pipelines that don't forget, fail, or fall apart Redis achieves Google Cloud Ready, Distributed Cloud status ahead of Google Cloud Next ‘26 | Redis Real-time network monitoring: what your data platform needs to keep up AI agent API: How agents connect to the real world What is multicloud infrastructure? A guide for 2026 What is a transaction monitoring system & how does it work? Why your AI agent fails in production & how tracing helps AI agent benchmarks: Where they fall short & why your infrastructure matters What is a JSON database (and when should you use one)? Introducing the Redis Partner Network: A new foundation for real-time innovation How real-time customer segmentation works in retail Payment orchestration & vault architecture in retail Agentic systems vs. GenAI: when generation isn't enough What is fuzzy matching? Semantic caching & routing: two powerful patterns for vector classification Redis alternatives: Why there are no exact substitutes Connect to Azure Managed Redis with Redis Insight 3.2.0 How to tame the thundering herd problem Redis to Manage Storage Replication | Redis How hierarchical navigable small world (HNSW) algorithms can improve search | Redis How leading financial institutions use Redis to drive growth | Redis What’s new in two: May 2025 | Redis Introducing Model Context Protocol (MCP) for Redis | Redis Redis vs. Elasticsearch: What’s faster for GenAI & vector search? | Redis Build fast, production-worthy AI apps with Spring AI and Redis | Redis Azure Managed Redis is GA today | Redis Redis then & now: Adapting with developers through every era | Redis Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability | Redis What’s new in two: April 2025 | Redis Redis 8 is now GA, loaded with new features and more than 30 performance improvements | Redis What is a data strategy? 6 key components explained Data replication explained: types, examples & use cases
API Latency in LLM Apps: Causes & How to Fix It
Redis · 2026-05-07 · via Redis

You ship a retrieval-augmented generation (RAG) feature, monitoring is green, and every endpoint returns 200. But users keep complaining the app feels sluggish, and your own dogfooding confirms it: there's a multi-second pause before anything renders, and follow-up turns drag. The status code says success. The user experience says otherwise.

That gap is API latency in LLM apps, and it's harder to pin down than in a traditional REST service. This article covers what API latency looks like in LLM apps, how to measure it, what causes it, and practical ways to bring it down.

What is API latency?

API latency is the elapsed time between a client sending a request and receiving a response. It's a composite measurement spanning DNS lookup, connection negotiation, server processing, and network transit in both directions. For traditional REST APIs, it's mostly a single number. LLM apps layer in model inference, vector store retrieval, and context assembly, so any one stage can blow up your tail latency even when the API returns cleanly.

And it's a business metric, not just a performance one. A 0.1-second mobile speed gain on lead-gen sites correlated with a 6.9% increase in page views per session, and 43% of companies estimate losing over $1 million per month from outages, slowdowns, and API issues.

How API latency shows up in your LLM UX

With that baseline, the next step is breaking the experience into the parts users actually feel. Three metrics drive LLM UX latency: time to first token, inter-token latency, and end-to-end latency. Streaming and industry benchmarks frame how those metrics translate to perception.

Time to first token (TTFT)

TTFT measures the gap between sending a prompt and the first token a user sees. This is often the metric users feel most acutely because it's the blank screen before anything happens. TTFT is driven by the prefill phase, where the model processes your entire input prompt in a single forward pass before generation begins. Context length drives a lot of it. For RAG pipelines with large retrieved contexts stuffed into prompts, context size is one of the biggest TTFT levers you control.

Inter-token latency (ITL)

Once that first token arrives, ITL measures the pace of token generation that follows. Slow inter-token times make a response feel sluggish even when TTFT looked acceptable.

End-to-end latency

End-to-end latency is the full wall-clock time from prompt submission to final token, which is TTFT plus the time spent generating the rest of the output. It's what bounds the total interaction.

Streaming and perceived latency

Streaming changes perception more than total runtime. Streaming and non-streaming responses can take the same wall-clock time but feel different because streaming lets users see tokens earlier. Without streaming, users see a multi-second blank screen, then everything at once. With streaming, that's a short initial wait plus progressive delivery, which is why TTFT often dominates streaming UX discussions.

Redis Database

Build fast, accurate AI apps that scale

Get started with Redis for real-time AI context and retrieval.

Where the industry sets the bar

Once you frame latency in user-visible phases, benchmark targets become easier to interpret. MLPerf Inference v6.0 (March 2026) is the latest round and expanded coverage with GPT-OSS 120B and a new DeepSeek-R1 interactive scenario for low-latency reasoning workloads. The interactive latency thresholds first introduced in MLPerf v5.0 remain a useful reference point for what "real-time" means in LLM serving: 450ms TTFT at P99 and 40ms time per output token at P99 for Llama 2 70B Interactive.

The RAIL UX guidance is a useful reference for interactive workloads: under 100ms feels immediate, under 1 second feels like natural progression, and anything over 1 second can start losing user focus. For user-facing LLM apps, that often makes TTFT the make-or-break window.

Measuring API latency for LLM & RAG workloads

Once you can see those phases, the next step is measuring them properly. Averages won't surface the latency problems your users actually feel. For LLM and RAG workloads, you need percentile tracking, stage-level instrumentation, and observability tooling that captures what's happening between request and response.

Percentiles over averages

Track tail latency, not averages. Averages mask tail behavior. If your average TTFT is 400ms but your P99 is 4 seconds, roughly 1 in 100 requests is having a terrible experience. At 10,000 requests per hour, that's 100 slow requests every hour—not a statistical rarity, but a continuous problem.

Track P50, P95, and P99 for each metric. Also watch your monitoring setup: histogram bucket configuration can materially affect percentile accuracy, which can distort your read on tail latency.

Break the RAG pipeline into stages

End-to-end latency alone won't tell you where the problem lives. You need per-stage instrumentation. Pipeline-level evaluations of text-query RAG systems show LLM generation accounting for a large share of latency, and similarity search is another stage worth isolating since retrieval latency varies based on vector embedding dimensionality, index configuration, and implementation details.

Observability tooling

Observability ties stage-level instrumentation together. Observability signals are often treated as the core signals here: traces for the sequence of events per request and prompt details, and metrics for request volume, duration, cost, and token counters.

Use that detail carefully, though. Capturing prompt content, vector embeddings, and detailed tool inputs can add latency because it increases data volume. Use sampling where feasible.

Common causes of high API latency in LLM apps

With instrumentation in place, the question shifts to what's actually slowing things down. High latency in LLM apps rarely comes from one source. It stacks across inference, retrieval, caching gaps, and orchestration. Here are the contributors that show up most often:

  • Autoregressive generation: LLM inference splits into a prefill phase (processes the full input) and a decode phase (generates tokens one at a time). That two-phase structure is why TTFT and token generation speed behave differently.
  • Cold starts and model loading: Idle serverless or auto-scaling endpoints make the first request wait on compute provisioning, container startup, and model weights loading into GPU memory. That startup path can dominate request time.
  • Multi-stage RAG pipeline overhead: Every network round trip to vector indexes, keyword indexes, and relational databases adds to total latency. The more stages your app fans out across, the more those delays accumulate.
  • Missing caching layers: Without caching, each LLM call recomputes the full attention mechanism over the complete prompt, even when large portions repeat. Agentic workflows compound this with repeated calls and duplicate document retrieval.
  • Database query and serialization overhead: LLM apps query databases for user context, conversation history, and metadata. An indexed-read spike can turn a 10ms query into 120ms, and JSON serialization at every API boundary adds more delay than many teams expect.
  • Orchestration framework overhead: Abstraction layers and sequential LLM calls multiply base inference latency. A simpler workflow with fewer LLM calls usually moves the needle.

These contributors rarely show up alone. Once you can see which ones dominate your pipeline, the optimization paths in the next section start to map directly to specific bottlenecks.

LLM memory 64px

Make your AI apps faster and cheaper

Cut costs by up to 90% and lower latency with semantic caching powered by Redis.

Practical ways to reduce API latency in LLM apps

Once those bottlenecks are clear, the most useful optimizations tend to fall into a few patterns: caching, retrieval, prompt size, and model efficiency.

Semantic caching

Semantic caching skips LLM calls when a new query means the same thing as a previous one. Instead of matching exact strings, it compares vector embeddings against previously answered queries and returns the cached response when similarity exceeds a threshold.

Threshold tuning matters because loose thresholds produce false positives. At a 0.7 similarity threshold with the all-MiniLM-L6-v2 embedding model, false positives reached 19.3% before quality controls were added. Semantic caching is high-leverage for repeated-intent workloads, but the threshold and embedding model are decisions worth validating against your own queries.

Prompt caching (key-value cache reuse)

Prompt caching reuses key-value tensors computed during the attention pass for static prompt prefixes, so the model skips re-processing the same prefix on every request. It's distinct from semantic caching: this is an inference-layer optimization, not a response cache. Prompt structure matters because only static prefixes get reused. Keeping system prompts and shared context at the front of your prompt is what makes this work.

Efficient vector indexing

Vector index choice and implementation drive retrieval latency. Hierarchical Navigable Small World (HNSW) is a graph-based approximate nearest neighbor algorithm that typically scales sub-linearly with dataset size, which is why it's widely used for production vector search. Vector benchmarks have found large latency differences across systems running the same HNSW configuration, so the algorithm choice alone doesn't determine production performance.

Model quantization

Quantization shrinks the memory footprint of model weights and can speed up inference by using lower-precision arithmetic. 8-bit integer (INT8) quantization can lower computational cost by roughly 40% versus 32-bit floating-point (FP32), with 4-bit integer (INT4) reaching about 65% in the same evaluation. Accuracy can degrade depending on the model and task, so workload-specific evaluation and a staged rollout are worth the time.

Pre-compute vector embeddings & prune context

Index-time vector embeddings move document embedding work out of the request path, so only the user's query gets embedded at inference time. Pruning helps on the other side: every token in your prompt increases TTFT, so stripping HTML and boilerplate from retrieved chunks and setting hard token limits on retrieved context can keep latency under control.

Where Redis helps: turning API latency improvements into app speed

Optimizations only land if your data layer can keep up. Redis supports sub-millisecond latency for many core operations and runs vector search, semantic caching, and operational data structures alongside each other. For latency-sensitive RAG pipelines, that cuts inter-system network hops and reduces the failure modes you have to monitor.

Semantic caching with Redis LangCache

Redis LangCache is a fully managed semantic caching service that handles storage, similarity checks, and LLM fallbacks through a REST API. In Redis-reported benchmarks, LangCache reported 15x faster hits and 73% lower costs under those test conditions, without code changes.

For teams that want more control, the RedisVL Python client offers a self-managed SemanticCache with configurable distance thresholds and time-to-live expiration. It integrates with LangChain, LlamaIndex, and LiteLLM.

Vector search without a separate database

Redis runs vector search inside the same instance that holds your operational data. The Redis Query Engine supports FLAT, HNSW, and SVS-VAMANA indexes stored alongside hash or JSON documents. In a Redis-reported benchmark on billion-vector datasets, Redis reported 90% precision at about 200ms median latency under conditions of 50 concurrent queries, top-100 neighbors, and round-trip time included. Hybrid search combines dense vector retrieval with sparse keyword retrieval in a single query, with metadata filtering and re-ranking available.

One platform, fewer hops

Many teams manage three systems separately: a vector database, a cache, and an operational store. Putting them behind Redis lets vector storage, semantic caching, session memory, and rate-limiting counters share one instance. For teams balancing latency and infrastructure spend, Redis Flex tiers data across RAM and SSD, with up to 80% lower memory costs in Redis benchmarks.

Redis Iris

Now see how this runs in Redis

Power AI apps with real-time context, vector search, and caching.

Faster LLM apps need a faster data layer

Latency in LLM apps is a stack of solvable problems, not a single mystery. TTFT, inter-token pacing, retrieval overhead, missing caches, and serialization costs each have known optimization paths, from semantic caching and prompt structuring to efficient vector indexing and infrastructure consolidation. The work is identifying which contributors dominate your pipeline, then applying the matching technique.

Redis fits into that stack as a memory-first platform for vector search, semantic caching, and operational data. Consolidating those layers means fewer network hops and a simpler architecture to keep responsive under load.

Try Redis free to test semantic caching and vector search with your own workloads, or talk to our team about optimizing your LLM infrastructure.