Context Engineering for AI: What It Is & How to Build It

Redis

Real-Time Fraud Detection: Latency, Features & Scale Context window in AI: why every token is a budget decision Connecting to Redis Cloud with AWS PrivateLink vs. VPC peering | Redis Redis Data Integration in Redis Cloud is now GA in AWS | Redis Why AI Misses Business Context & How Teams Fix It AI Reasoning Explained: Why Context Matters Semantic Layer vs Context Layer: Key Differences Redis array data type: How it works and when to use it Context Graphs vs. Vector Search: When RAG Falls Short What’s new in two – May 2026 edition Redis 8.8 performance improvements: Faster string, hash, streams, SCAN & more Redis 8.8: New array data structure & open source features How Conflict-free Replicated Data Types power active-active database replication Context Orchestration: What It Is & How It Works Context Compaction for AI Agents: A Complete Guide Prompt Bloat: Causes, Costs & Fixes for LLM Apps Agentic Retrieval Techniques: A Complete Guide Single-shot reliable consumers with XREADGROUP CLAIM in Redis 8.4 | Redis Long-Horizon AI Agents: Memory & State Infrastructure What is a context engine? What Is a Context Layer? AI Agent Infrastructure Context Retrieval for AI Agents: What It Is & Why It Matters Context Poisoning: How Bad Data Breaks Agent Reasoning Context is all you need: Introducing Redis Iris | Redis Dynamic endpoints: Migrate databases without changing your endpoint | Redis AI Shopping Assistants: How They Work & What to Build Endless Aisle Retail: Infrastructure & Real-Time Data LLM Speed Benchmarks: Metrics & Infrastructure Guide Context Pruning: Cut LLM Tokens Without Losing Quality What’s new in two – April 2026 edition Agentic AI Architecture: 5 Patterns Explained AI Agent vs Chatbot: Key Differences Explained Advantages of Building a Vector Search Solution API Latency in LLM Apps: Causes & How to Fix It Security advisory: [CVE‑2026‑23479] [CVE‑2026‑25243] [CVE-2026-25588] [CVE‑2026‑25589] [CVE-2026-23631] | Redis Edge Computing Latency: Causes & How to Reduce It AI Agents vs Workflows: When to Use Each Streaming LLM Responses: Make Your AI App Feel Fast Active-Active vs Active-Passive Database Architecture Prefill vs Decode: LLM Inference Phases Explained Long-Term Memory Architectures for AI Agents Time to First Byte Test: Tools, Causes & Fixes Speculative decoding: how it works & when to use it P95 Latency: What It Is & Why It Matters Why Multi-Agent LLM Systems Fail & How to Fix Them AI Human in the Loop: Production Oversight Patterns Native OpenTelemetry metrics for Redis client libraries | Redis Client-side geographic failover for Redis Active-Active | Redis Use Redis with SQL | Redis Introducing Redis Feature Form Build Google ADK Agents with persistent, real-time memory on Redis | Redis Startup Spotlight: Neuron Systems API Throttling: Algorithms, Patterns & Mistakes Agentic AI Examples Across 6 Industries Best Chunking Strategies for RAG Pipelines Agentic AI Guardrails: Controls That Work Redis joins AWS at GDC to support the next generation of gaming | Redis Designing a semantic routing system: From static rules to dynamic intelligence with Redis and Java | Redis Real-Time Dispatch System: A Complete Guide P99 Latency: What It Means & How to Fix It Tokenization in LLMs: What AI App Devs Need to Know TTFT Meaning: What is Time to First Token? Atomic slot migration with Redis 8.4 Hybrid search benefits: Why your RAG system needs both keyword & vector search What’s new in two: March 2026 edition Vector embedding generators: How they work & how to use them Throughput-optimizing Redis for L2 KV Cache Reuse What is a data pipeline? Building AI agent pipelines that don't forget, fail, or fall apart Redis achieves Google Cloud Ready, Distributed Cloud status ahead of Google Cloud Next ‘26 | Redis Real-time network monitoring: what your data platform needs to keep up AI agent API: How agents connect to the real world What is multicloud infrastructure? A guide for 2026 What is a transaction monitoring system & how does it work? Why your AI agent fails in production & how tracing helps AI agent benchmarks: Where they fall short & why your infrastructure matters What is a JSON database (and when should you use one)? Introducing the Redis Partner Network: A new foundation for real-time innovation How real-time customer segmentation works in retail Payment orchestration & vault architecture in retail Agentic systems vs. GenAI: when generation isn't enough What is fuzzy matching? Semantic caching & routing: two powerful patterns for vector classification Redis alternatives: Why there are no exact substitutes Connect to Azure Managed Redis with Redis Insight 3.2.0 How to tame the thundering herd problem Redis to Manage Storage Replication | Redis How hierarchical navigable small world (HNSW) algorithms can improve search | Redis How leading financial institutions use Redis to drive growth | Redis What’s new in two: May 2025 | Redis Introducing Model Context Protocol (MCP) for Redis | Redis Redis vs. Elasticsearch: What’s faster for GenAI & vector search? | Redis Build fast, production-worthy AI apps with Spring AI and Redis | Redis Azure Managed Redis is GA today | Redis Redis then & now: Adapting with developers through every era | Redis Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability | Redis What’s new in two: April 2025 | Redis Redis 8 is now GA, loaded with new features and more than 30 performance improvements | Redis What is a data strategy? 6 key components explained Data replication explained: types, examples & use cases

Redis · 2026-05-13 · via Redis

Your support agent confidently tells a customer they qualify for a refund under a 60-day return policy. Your actual policy is 30 days. The agent hallucinated the longer window, and the easy reaction is to blame the model. But the model never saw your return policy. The failure happened upstream, in what got loaded into the context window.

Recognizing this is driving a shift in how teams build AI apps, and the practice now has a name: context engineering. It's the discipline of designing and managing everything an LLM receives during inference, not just the prompt but the full set of tokens that land in the context window.

This guide covers what context engineering is, why it's become an important foundation for reliable AI agent systems, and what infrastructure you need to support it.

Context engineering means deciding what goes into the context window at each step of an agent's run. Those inputs include system instructions, conversation history, retrieved documents, tool definitions, tool call results, and working state. The guiding principle is simple: find the smallest set of high-signal tokens that maximizes the likelihood of your desired outcome.

Each of those inputs competes for space in a finite context window, and each affects the quality of the model's output. Where prompt engineering focuses on phrasing a single instruction, context engineering covers everything else that fills the window around it.

Dimension	Prompt engineering	Context engineering
Scope	Crafting the instruction text	Determining everything that fills the context window
Methodology	Often one-off, focused on phrasing	Systematic, repeatable architectural frameworks
Production fit	Effective for single-turn, stateless interactions	Required for multi-step agentic systems
Key question	"How do I phrase this instruction?"	"What information and environment does the model need to succeed?"

Why agents need context engineering to work

Agents need context engineering because they can't function reliably without it. A single-turn chatbot can get by on a well-phrased prompt, but agents run across multiple steps, call tools, accumulate state, and often resume work after delays. Every one of those actions changes what should be in the context window, and there's no prompt clever enough to manage that on its own.

In many agent patterns, tool outputs land directly in the model's context window. Over a multi-step task, that accumulated context can exceed the window's capacity, increase costs and latency, or degrade the agent's reasoning quality. Without a deliberate system for deciding what to keep, retrieve, compress, or discard at each step, agents drift, lose track of earlier decisions, and hallucinate. Most of those failures trace back to context.

These failures show up in a few recurring patterns:

Tool call accumulation: Function outputs flood the window. A single tool call can return thousands of tokens of JSON, and across a multi-step run those outputs pile up until they crowd out instructions, conversation, and retrieved context. The result is context window overflow, where the model is reasoning from raw tool output instead of the task at hand.
Context degradation over long tasks: More tokens don't mean better reasoning. As accumulated text fills a fixed window, recent content crowds out earlier information. Long context works well for retrieval and summarization but can distract during multi-step work.
State persistence gaps: Stateless infrastructure can't hold an agent's working state. Traditional request-response architectures lack structured ways to store, resume, or edit the compound state that an agent accumulates mid-run, which breaks down when human-in-the-loop review adds long delays.
Multi-agent context leakage: Handoffs lose information at the boundary. When one agent escalates a case to another (or to a human) without transferring conversation history, the customer ends up repeating themselves. That's a handoff boundary failure.

Each of these failures sits upstream of the model, in how context is assembled and managed across the run. Context engineering exists to address them at that layer.

Redis Iris

Build fast, accurate AI apps that scale

Get started with Redis for real-time AI context and retrieval.

The four operations of context engineering

Those failure modes lead to a practical question: what do you actually do about them? Most context-engineering approaches map to four operations.

Write: store context externally for later retrieval

Save context to external storage instead of letting it pile up in the window. Scratchpads hold intermediate reasoning, tool outputs go to persistent stores, and long-term memory lives outside the model entirely. One common pattern keeps large responses outside the active context window and replaces them with a lightweight reference plus a short preview, so the agent can pull the full payload back only if it needs to.

Select: retrieve only what's relevant at each step

Pull in just the context the current step needs, not everything you've ever stored. This is where retrieval-augmented generation (RAG) fits in: your app encodes the query into a vector, queries a vector database for the most similar chunks, and passes those results to the model as context. More capable agents call the vector store as a tool across multiple turns, deciding when and what to retrieve as the task unfolds, rather than running a single lookup before the LLM call.

Selection also applies to tools and memory. Agents get overloaded when too many tools are exposed at once, especially when descriptions overlap and the model has to guess which one to use. Applying RAG to the tool descriptions themselves narrows the list to the most relevant options for each task. For memory, selection means retrieving by vector similarity over stored interactions rather than injecting everything wholesale, often combined with recent-context retrieval for short-term relevance and summarization to keep storage bounded.

Compress: reduce token count while preserving signal

Shrink what you can't exclude. Common strategies are summarizing accumulated context at periodic checkpoints with an LLM, trimming the oldest messages as you approach the window limit, and replacing large tool outputs with pointers to persisted files. The trade-off worth knowing up front: even small hallucinations in a summary can contaminate every step that follows, so compression needs to be applied carefully.

Isolate: prevent context pollution across tasks

Put boundaries around context so unrelated work doesn't bleed together. Complex tasks can be broken into focused steps, each with its own optimized context window. Multi-agent architectures get this for free. Each sub-agent works inside its own context boundary and returns only the result to the parent, so intermediate reasoning never crowds out the rest of the system.

Infrastructure requirements for context assembly

Write, select, compress, and isolate all depend on the systems running underneath them. Context engineering puts retrieval directly in the inference path, which means the storage and query layer determines whether your context pipeline runs fast enough to be useful in production.

Multiple query modalities, one latency budget

An agent's context has multiple layers: working state, long-term memory, and structured metadata. Each one needs different storage and retrieval semantics. Short-term working memory benefits from low-latency key-value access by session or thread ID. Long-term semantic memory needs vector-indexed retrieval over vector embeddings. Metadata filtering needs inverted indexes for exact-match and range queries. Teams often run multiple storage and retrieval primitives in parallel rather than forcing all of this through a single general-purpose database.

Hybrid retrieval makes this even more demanding. Combining BM25 keyword ranking with semantic embedding search means the platform has to support multiple query modalities and merge or rerank the results within a shared retrieval latency budget.

Vector retrieval latency

Vector search is the most expensive piece of the pipeline and often the slowest. It's computationally heavier than traditional database lookups, and agentic systems usually need non-blocking I/O so a slow query doesn't stall the rest of the pipeline.

Memory

Give your AI apps real-time context

Run them on Redis for AI, built for fast retrieval and low-latency responses.

Real-time data freshness

Fast retrieval doesn't help if the data behind it is stale. Context loaded through hourly or nightly batch refreshes is already out of date the moment it's consumed, and for agents working with live events or operational data that means reasoning about a world that has already moved on. Streaming ingestion fits better for apps where freshness directly affects whether the model's output is correct.

Semantic caching infrastructure

Semantic caching patterns cut repeated work by recognizing when a new query means the same thing as one you've already answered. The system embeds the incoming query, compares it to cached entries using a similarity metric, and returns the cached result if the match is close enough, skipping retrieval and another LLM call.

Building this typically requires query-time vector embeddings, an approximate nearest neighbor index over cached query embeddings, a configurable similarity threshold, and cache invalidation logic that accounts for model updates and vector drift. That last piece is the tricky one: cache failures are often silent, so your API can return a 200 OK while costs and quality quietly suffer behind the scenes.

Where Redis fits in the context engineering stack

Redis goes far beyond caching. It acts as a real-time context engine that gathers, syncs, and serves the data your AI apps need to respond accurately and at speed. A production context architecture typically means stitching together a vector database, a cache, a messaging layer, and a task queue. Redis combines those primitives in one in-memory platform, so a single system covers the storage, retrieval, and messaging paths a context pipeline depends on.

Agent memory: short-term & long-term in one system

Redis serves agent memory through a dual-tier architecture. Short-term memory uses in-memory data structures for sub-millisecond access to immediate conversational context: agent state, chat history, and running summaries. Long-term memory holds durable facts and user preferences extracted from past sessions, retrieved by conceptual similarity rather than exact keywords.

Vector search for retrieval

Vector retrieval runs directly in the inference path. The Redis Query Engine supports exact vector search with FLAT indexing and approximate search with Hierarchical Navigable Small World (HNSW), alongside full-text and numeric search. Vectors and their associated metadata can be stored inside hashes or JSON documents, so a single query can filter on structured fields and run similarity search at the same time.

Semantic caching with Redis LangCache

Paraphrased queries don't need a fresh LLM call. Redis LangCache provides this as a managed semantic cache, delivered via REST API. Apps get cache hits without building the embedding and similarity logic themselves. In benchmarks, Redis LangCache reported up to 15x faster responses for cache hits and up to 73% lower costs.

Redis Productivity

You've made it this far

Now see how this actually runs in Redis. Power AI apps with real-time context, retrieval, and semantic caching.

Build your context engine on Redis

If prompt engineering is about phrasing, context engineering is about system design. Model quality matters, but what you feed the model matters just as much. Assembling that input reliably is an infrastructure problem spanning retrieval, memory, caching, real-time ingestion, and multi-agent coordination.

Teams that treat context as an engineering surface, with explicit ownership and purpose-built infrastructure underneath, tend to ship more reliable agents than teams optimizing prompts alone. Consolidating those pieces in one system reduces the number of moving parts in the critical path between a user request and a model response.

Try it yourself with a free Redis account, or talk to the team about building your context engineering stack on Redis.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Redis

Why agents need context engineering to work

Build fast, accurate AI apps that scale

The four operations of context engineering

Write: store context externally for later retrieval

Select: retrieve only what's relevant at each step

Compress: reduce token count while preserving signal

Isolate: prevent context pollution across tasks

Infrastructure requirements for context assembly

Multiple query modalities, one latency budget

Vector retrieval latency

Give your AI apps real-time context

Real-time data freshness

Semantic caching infrastructure

Where Redis fits in the context engineering stack

Agent memory: short-term & long-term in one system

Vector search for retrieval

Semantic caching with Redis LangCache

You've made it this far

Build your context engine on Redis