惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Microsoft Security Blog
Microsoft Security Blog
Google DeepMind News
Google DeepMind News
P
Privacy International News Feed
www.infosecurity-magazine.com
www.infosecurity-magazine.com
T
Threatpost
GbyAI
GbyAI
V
Visual Studio Blog
H
Help Net Security
Vercel News
Vercel News
P
Palo Alto Networks Blog
Project Zero
Project Zero
AWS News Blog
AWS News Blog
Latest news
Latest news
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
The Register - Security
The Register - Security
博客园_首页
WordPress大学
WordPress大学
G
GRAHAM CLULEY
T
Tor Project blog
有赞技术团队
有赞技术团队
Know Your Adversary
Know Your Adversary
AI
AI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
O
OpenAI News
博客园 - 聂微东
月光博客
月光博客
S
Security Affairs
Webroot Blog
Webroot Blog
L
LangChain Blog
Apple Machine Learning Research
Apple Machine Learning Research
NISL@THU
NISL@THU
N
News and Events Feed by Topic
Blog — PlanetScale
Blog — PlanetScale
S
Securelist
V
Vulnerabilities – Threatpost
aimingoo的专栏
aimingoo的专栏
阮一峰的网络日志
阮一峰的网络日志
Stack Overflow Blog
Stack Overflow Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
D
DataBreaches.Net
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Y
Y Combinator Blog
Cisco Talos Blog
Cisco Talos Blog
The Cloudflare Blog
IT之家
IT之家
博客园 - 三生石上(FineUI控件)
雷峰网
雷峰网
L
Lohrmann on Cybersecurity
T
The Blog of Author Tim Ferriss

Redis

Real-Time Fraud Detection: Latency, Features & Scale Connecting to Redis Cloud with AWS PrivateLink vs. VPC peering | Redis Redis Data Integration in Redis Cloud is now GA in AWS | Redis Why AI Misses Business Context & How Teams Fix It AI Reasoning Explained: Why Context Matters Semantic Layer vs Context Layer: Key Differences Redis array data type: How it works and when to use it Context graphs: when nearest-neighbor search isn't enough What’s new in two – May 2026 edition Redis 8.8 performance improvements: Faster string, hash, streams, SCAN & more Redis 8.8: New array data structure & open source features How Conflict-free Replicated Data Types power active-active database replication Context Orchestration: What It Is & How It Works Context Compaction for AI Agents: A Complete Guide Prompt Bloat: Causes, Costs & Fixes for LLM Apps Agentic Retrieval Techniques: A Complete Guide Single-shot reliable consumers with XREADGROUP CLAIM in Redis 8.4 | Redis Long-Horizon AI Agents: Memory & State Infrastructure What is a context engine? What Is a Context Layer? AI Agent Infrastructure Context Retrieval for AI Agents: What It Is & Why It Matters Context Poisoning: How Bad Data Breaks Agent Reasoning Context is all you need: Introducing Redis Iris | Redis Context Engineering for AI: What It Is & How to Build It Dynamic endpoints: Migrate databases without changing your endpoint | Redis AI Shopping Assistants: How They Work & What to Build Endless Aisle Retail: Infrastructure & Real-Time Data LLM Speed Benchmarks: Metrics & Infrastructure Guide Context Pruning: Cut LLM Tokens Without Losing Quality What’s new in two – April 2026 edition Agentic AI Architecture: 5 Patterns Explained AI Agent vs Chatbot: Key Differences Explained Advantages of Building a Vector Search Solution API Latency in LLM Apps: Causes & How to Fix It Security advisory: [CVE‑2026‑23479] [CVE‑2026‑25243] [CVE-2026-25588] [CVE‑2026‑25589] [CVE-2026-23631] | Redis Edge Computing Latency: Causes & How to Reduce It AI Agents vs Workflows: When to Use Each Streaming LLM Responses: Make Your AI App Feel Fast Active-Active vs Active-Passive Database Architecture Prefill vs Decode: LLM Inference Phases Explained Long-Term Memory Architectures for AI Agents Time to First Byte Test: Tools, Causes & Fixes Speculative decoding: how it works & when to use it P95 Latency: What It Is & Why It Matters Why Multi-Agent LLM Systems Fail & How to Fix Them AI Human in the Loop: Production Oversight Patterns Native OpenTelemetry metrics for Redis client libraries | Redis Client-side geographic failover for Redis Active-Active | Redis Use Redis with SQL | Redis Introducing Redis Feature Form Build Google ADK Agents with persistent, real-time memory on Redis | Redis Startup Spotlight: Neuron Systems API Throttling: Algorithms, Patterns & Mistakes Agentic AI Examples Across 6 Industries Best Chunking Strategies for RAG Pipelines Agentic AI Guardrails: Controls That Work Redis joins AWS at GDC to support the next generation of gaming | Redis Designing a semantic routing system: From static rules to dynamic intelligence with Redis and Java | Redis Real-Time Dispatch System: A Complete Guide P99 Latency: What It Means & How to Fix It Tokenization in LLMs: What AI App Devs Need to Know TTFT Meaning: What is Time to First Token? Atomic slot migration with Redis 8.4 Hybrid search benefits: Why your RAG system needs both keyword & vector search What’s new in two: March 2026 edition Vector embedding generators: How they work & how to use them Throughput-optimizing Redis for L2 KV Cache Reuse What is a data pipeline? Building AI agent pipelines that don't forget, fail, or fall apart Redis achieves Google Cloud Ready, Distributed Cloud status ahead of Google Cloud Next ‘26 | Redis Real-time network monitoring: what your data platform needs to keep up AI agent API: How agents connect to the real world What is multicloud infrastructure? A guide for 2026 What is a transaction monitoring system & how does it work? Why your AI agent fails in production & how tracing helps AI agent benchmarks: Where they fall short & why your infrastructure matters What is a JSON database (and when should you use one)? Introducing the Redis Partner Network: A new foundation for real-time innovation How real-time customer segmentation works in retail Payment orchestration & vault architecture in retail Agentic systems vs. GenAI: when generation isn't enough What is fuzzy matching? Semantic caching & routing: two powerful patterns for vector classification Redis alternatives: Why there are no exact substitutes Connect to Azure Managed Redis with Redis Insight 3.2.0 How to tame the thundering herd problem Redis to Manage Storage Replication | Redis How hierarchical navigable small world (HNSW) algorithms can improve search | Redis How leading financial institutions use Redis to drive growth | Redis What’s new in two: May 2025 | Redis Introducing Model Context Protocol (MCP) for Redis | Redis Redis vs. Elasticsearch: What’s faster for GenAI & vector search? | Redis Build fast, production-worthy AI apps with Spring AI and Redis | Redis Azure Managed Redis is GA today | Redis Redis then & now: Adapting with developers through every era | Redis Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability | Redis What’s new in two: April 2025 | Redis Redis 8 is now GA, loaded with new features and more than 30 performance improvements | Redis What is a data strategy? 6 key components explained Data replication explained: types, examples & use cases
Context window in AI: why every token is a budget decision
Redis · 2026-06-10 · via Redis

Some of today's most capable LLMs now support very large context windows. That doesn't mean you should fill them. Context windows have grown fast, but the underlying cost and quality tradeoffs haven't gone away. They've just gotten easier to ignore.

Every token you put into a context window can add cost, and longer contexts can also hurt reasoning quality. Treating the window as a hard limit to fill misses the point. It's a budget, and what you leave out matters as much as what you put in.

This guide covers what context windows actually are, why filling them can degrade both cost and model performance, and how to keep your context lean without losing the information that matters.

What is an AI context window?

A context window is the total number of tokens an LLM can process in a single inference pass. It covers both what you send and what the model generates back. On the input side, that means the system prompt, conversation history, any retrieved documents, and your tool definitions and their outputs; on the output side, the model's own response. Every window has a fixed size limit, and anything that doesn't fit is invisible to the model. It can't reach back for a document you left out, so what you omit shapes the answer as much as what you include.

One common misconception is that the context window and the max output limit are the same thing. They aren't. The total window covers both input and output, while output limits can still cap how much the model returns.

Here's a rough sense of scale for English text:

  • 1,000 tokens: about 750 words, or 3 pages
  • 128,000 tokens: roughly 96,000 words, or 384 pages
  • 1,000,000 tokens: around 750,000 words, or roughly 3,000 pages

Numbers like 384 pages or 3,000 pages sound like more room than any single prompt could ever need, which is why large windows feel so spacious in practice. But that intuition hides the tradeoff: every extra page still competes for cost and attention inside the same finite budget, and models don't always use the extra space as well as the headline number suggests.

That size limit exists because of self-attention: in a standard transformer, every token has to relate to every other token in the window, so the work grows fast as the window fills. A bigger window costs you twice. It costs more to run, and it can cost you reasoning quality.

Redis Iris

Redis Iris serves agent context in milliseconds

Redis Iris connects memory, live data, and retrieval in one place.

The two costs of every token: dollars & degraded reasoning

Start with the part you can see on your bill. You pay by the token, so a longer prompt costs more, every time you send it. The exact rates differ between providers, but the direction never changes: more tokens in, more money out.

Bigger context windows also don't guarantee better reasoning. Several long-context studies report accuracy slipping as context length grows, though the size of the drop depends on the model, the task, the retrieval setup, and where the relevant information sits in the prompt.

Three patterns show up across that research. The first is raw volume. In one study, Llama 3's HumanEval coding accuracy dropped by about half at 30,000 tokens compared to its baseline, with similar declines on GSM8K math reasoning and variable summation. The telling detail: it wasn't where the relevant content sat in the prompt that hurt performance, it was the sheer amount of input.

The second is position. Many transformer models studied so far show a U-shaped attention pattern, attending more to content at the beginning and end of the context while underweighting the middle, the "lost in the middle" problem. If your retrieval-augmented generation (RAG) pipeline drops its most relevant chunks into the middle of a long context block, the model may be less likely to use them.

The third is diminishing returns. A study of 13 long-context models found that, in that benchmark, most peaked around 20,000 tokens for in-context learning and got no better past that point. The advertised limit and the useful limit aren't the same number.

The takeaway is straightforward: longer contexts can mean higher token spend while also risking worse results, and both costs compound with unnecessary tokens.

Where the budget goes: system prompt, history, retrieved data & tool output

If longer context costs more and reasons worse, it helps to know where that budget goes. The context window is a rival resource: every token you give to one component displaces a token from another. The budget usually includes at least six, though some providers add hidden system or routing tokens on top.

  • System prompt: The standing instructions that define how the model behaves. It's part of the input, billed per token and often resent on every API call.
  • Tool schemas: The definitions that tell the model which tools it can call and how. They scale with the number of tools you expose.
  • Conversation history: The running transcript of the exchange so far. It grows every turn unless you trim or summarize it, a common cause of context overflow in long sessions.
  • RAG retrieved chunks: The passages your retrieval step pulls in to ground the answer. Chunk size is a key lever, and sloppy formatting can eat a large share of tokens.
  • Tool call outputs: Whatever a tool returns when the model calls it. These range from a few tokens to very large, and big responses can crowd out the rest of the prompt.
  • Output buffer: The space reserved for the model's own response. Set it aside explicitly rather than treating it as whatever's left over.

Those are where most context pressure comes from. Seeing the budget line by line makes it much easier to decide what belongs in the window and what doesn't.

Spend less by keeping context out of the window until you need it

With the budget mapped, the goal shifts to keeping as much of that material out of the window as possible until it's actually needed. Context engineering treats the context window as something you actively curate rather than passively fill, holding information outside the window until the moment it's relevant.

Several strategies are well-documented, but the common idea is simple: pull in only what the model needs for the current step.

Sliding window

Keep only the most recent conversation turns and discard everything older. It's the simplest approach and a good starting point. The tradeoff is permanent information loss, but for short-task agents and customer service bots where recent context matters most, that's often acceptable.

Lazy context loading

Load tool definitions and reference material only when a specific reasoning step requires them. Dynamic tool gating reduces tool overhead by loading only the tools relevant to the current step, rather than listing every available tool on every call.

Retrieval-on-demand

Keep your knowledge in an external store and retrieve only the top semantically relevant chunks at query time. The context window never sees the full corpus. Passing only related documents cuts the amount of irrelevant content in the prompt.

External memory stores

For agents that need continuity across sessions, move long-term memory entirely outside the context window into a persistent store. Once a conversation ends, the context is gone. External memory systems retrieve only the relevant slice for each turn, preserving continuity across conversations without carrying the full history in context.

Most of these strategies need one thing: storage fast enough to fetch the right context mid-request without stalling the model. Redis Iris is designed for exactly that. Iris is a real-time context engine, and its parts line up with the strategies above. Context Retriever (public preview) pulls the operational data an agent needs at query time. Agent Memory (public preview) keeps long-term recall outside the window and returns only the slice that matters each turn. LangCache catches repeated questions so they don't hit the model twice. All of it runs with sub-millisecond response times, fast enough that retrieval never becomes the bottleneck.

Redis AI Agent Memory

Build agents that remember, not agents that guess

Redis Iris gives every agent fresh context and long-term memory.

The common thread across these strategies is simple: context management is about selection, not stuffing everything into the prompt. You decide what earns a spot in the window for each call, and Iris keeps everything else close by.

How semantic caching cuts repeat spend on the same intent

Curating context reduces input spend, but it doesn't address the other half of the bill: repeated calls for the same intent. Even with a tightly managed context strategy, your app can still make many LLM calls that are semantically identical to previous ones. Semantic caching catches those duplicates and returns the stored response instead of making another model call.

How semantic caching differs from exact-match caching

Semantic caching stores LLM responses indexed by vector embeddings of the input prompt and returns a cached response when a new prompt clears a configured similarity threshold. Unlike exact-match caching, which only catches identical strings, semantic caching works at the intent level. "What's the weather today?" and "Tell me today's temperature" can hit the same cache entry.

Tuning the similarity threshold

The similarity threshold is the dial that decides what counts as "the same question." Set it too loose and unrelated prompts collide, so the cache hands back a wrong answer. Set it too tight and real matches slip through, so you pay for calls you could have cached. Most teams tune it against their own traffic, watching for false hits on one side and missed matches on the other, and settle on the point that catches the most repeats without serving bad answers.

Where Redis LangCache fits

Redis LangCache is the semantic caching service in Iris, built on Redis' vector search to store and retrieve LLM responses at the speed Redis is known for. In Redis-reported results for high-repetition workloads, LangCache showed 73% lower inference costs and, in separate Redis benchmarks, up to 15x faster responses for cache hits.

A caveat for multi-turn conversations

One caveat is worth keeping in mind: standard semantic caching works best for single-turn queries. Multi-turn conversations introduce more complexity because follow-up questions can be falsely matched to earlier, unrelated prompts. Production systems handling multi-turn interactions need to account for conversational context in their caching logic.

Built for speed

Fresh context, every call

Redis Iris keeps agent data current so answers stay accurate.

The cheapest token is the one you never send

Filling a giant context window just because you can is like maxing out a credit card because the limit is high. Reasoning quality can degrade well before a model's nominal context limit, so reliable AI systems treat context as a finite resource: select what goes in, keep everything else in fast external storage, and avoid paying twice for repeated intent.

That's the pattern Redis Iris is built for. Context Retriever, Agent Memory, and LangCache all run on one real-time context engine with sub-millisecond response times, so the storage layer never becomes the reason your app feels slow. Iris retrieves only the relevant context for each call, LangCache returns a stored answer on a cache hit instead of calling the model again, and long-term memory stays outside the window until you need it.

Try Redis free to test semantic caching and context retrieval against your own workloads, or talk to the team about building context infrastructure that scales with your AI apps.