惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
Martin Fowler
Martin Fowler
T
Threatpost
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
C
CERT Recently Published Vulnerability Notes
V
Vulnerabilities – Threatpost
Help Net Security
Help Net Security
Project Zero
Project Zero
博客园 - 聂微东
博客园_首页
T
Tor Project blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
Visual Studio Blog
人人都是产品经理
人人都是产品经理
The Register - Security
The Register - Security
Latest news
Latest news
K
Kaspersky official blog
L
LINUX DO - 热门话题
P
Proofpoint News Feed
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
C
Cyber Attacks, Cyber Crime and Cyber Security
A
Arctic Wolf
aimingoo的专栏
aimingoo的专栏
J
Java Code Geeks
F
Full Disclosure
Recent Announcements
Recent Announcements
SecWiki News
SecWiki News
C
Cybersecurity and Infrastructure Security Agency CISA
F
Fortinet All Blogs
The Hacker News
The Hacker News
Apple Machine Learning Research
Apple Machine Learning Research
NISL@THU
NISL@THU
The GitHub Blog
The GitHub Blog
量子位
Hugging Face - Blog
Hugging Face - Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
P
Palo Alto Networks Blog
T
Troy Hunt's Blog
O
OpenAI News
T
Threat Research - Cisco Blogs
博客园 - Franky
Hacker News - Newest:
Hacker News - Newest: "LLM"
A
About on SuperTechFans
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tenable Blog

Redis

Real-Time Fraud Detection: Latency, Features & Scale Context window in AI: why every token is a budget decision Connecting to Redis Cloud with AWS PrivateLink vs. VPC peering | Redis Redis Data Integration in Redis Cloud is now GA in AWS | Redis Why AI Misses Business Context & How Teams Fix It AI Reasoning Explained: Why Context Matters Semantic Layer vs Context Layer: Key Differences Redis array data type: How it works and when to use it Context Graphs vs. Vector Search: When RAG Falls Short What’s new in two – May 2026 edition Redis 8.8 performance improvements: Faster string, hash, streams, SCAN & more Redis 8.8: New array data structure & open source features How Conflict-free Replicated Data Types power active-active database replication Context Orchestration: What It Is & How It Works Context Compaction for AI Agents: A Complete Guide Prompt Bloat: Causes, Costs & Fixes for LLM Apps Agentic Retrieval Techniques: A Complete Guide Single-shot reliable consumers with XREADGROUP CLAIM in Redis 8.4 | Redis Long-Horizon AI Agents: Memory & State Infrastructure What is a context engine? What Is a Context Layer? AI Agent Infrastructure Context Retrieval for AI Agents: What It Is & Why It Matters Context Poisoning: How Bad Data Breaks Agent Reasoning Context is all you need: Introducing Redis Iris | Redis Context Engineering for AI: What It Is & How to Build It Dynamic endpoints: Migrate databases without changing your endpoint | Redis AI Shopping Assistants: How They Work & What to Build Endless Aisle Retail: Infrastructure & Real-Time Data LLM Speed Benchmarks: Metrics & Infrastructure Guide Context Pruning: Cut LLM Tokens Without Losing Quality What’s new in two – April 2026 edition Agentic AI Architecture: 5 Patterns Explained AI Agent vs Chatbot: Key Differences Explained Advantages of Building a Vector Search Solution API Latency in LLM Apps: Causes & How to Fix It Security advisory: [CVE‑2026‑23479] [CVE‑2026‑25243] [CVE-2026-25588] [CVE‑2026‑25589] [CVE-2026-23631] | Redis Edge Computing Latency: Causes & How to Reduce It AI Agents vs Workflows: When to Use Each Streaming LLM Responses: Make Your AI App Feel Fast Active-Active vs Active-Passive Database Architecture Prefill vs Decode: LLM Inference Phases Explained Time to First Byte Test: Tools, Causes & Fixes Speculative decoding: how it works & when to use it P95 Latency: What It Is & Why It Matters Why Multi-Agent LLM Systems Fail & How to Fix Them AI Human in the Loop: Production Oversight Patterns Native OpenTelemetry metrics for Redis client libraries | Redis Client-side geographic failover for Redis Active-Active | Redis Use Redis with SQL | Redis Introducing Redis Feature Form Build Google ADK Agents with persistent, real-time memory on Redis | Redis Startup Spotlight: Neuron Systems API Throttling: Algorithms, Patterns & Mistakes Agentic AI Examples Across 6 Industries Best Chunking Strategies for RAG Pipelines Agentic AI Guardrails: Controls That Work Redis joins AWS at GDC to support the next generation of gaming | Redis Designing a semantic routing system: From static rules to dynamic intelligence with Redis and Java | Redis Real-Time Dispatch System: A Complete Guide P99 Latency: What It Means & How to Fix It Tokenization in LLMs: What AI App Devs Need to Know TTFT Meaning: What is Time to First Token? Atomic slot migration with Redis 8.4 Hybrid search benefits: Why your RAG system needs both keyword & vector search What’s new in two: March 2026 edition Vector embedding generators: How they work & how to use them Throughput-optimizing Redis for L2 KV Cache Reuse What is a data pipeline? Building AI agent pipelines that don't forget, fail, or fall apart Redis achieves Google Cloud Ready, Distributed Cloud status ahead of Google Cloud Next ‘26 | Redis Real-time network monitoring: what your data platform needs to keep up AI agent API: How agents connect to the real world What is multicloud infrastructure? A guide for 2026 What is a transaction monitoring system & how does it work? Why your AI agent fails in production & how tracing helps AI agent benchmarks: Where they fall short & why your infrastructure matters What is a JSON database (and when should you use one)? Introducing the Redis Partner Network: A new foundation for real-time innovation How real-time customer segmentation works in retail Payment orchestration & vault architecture in retail Agentic systems vs. GenAI: when generation isn't enough What is fuzzy matching? Semantic caching & routing: two powerful patterns for vector classification Redis alternatives: Why there are no exact substitutes Connect to Azure Managed Redis with Redis Insight 3.2.0 How to tame the thundering herd problem Redis to Manage Storage Replication | Redis How hierarchical navigable small world (HNSW) algorithms can improve search | Redis How leading financial institutions use Redis to drive growth | Redis What’s new in two: May 2025 | Redis Introducing Model Context Protocol (MCP) for Redis | Redis Redis vs. Elasticsearch: What’s faster for GenAI & vector search? | Redis Build fast, production-worthy AI apps with Spring AI and Redis | Redis Azure Managed Redis is GA today | Redis Redis then & now: Adapting with developers through every era | Redis Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability | Redis What’s new in two: April 2025 | Redis Redis 8 is now GA, loaded with new features and more than 30 performance improvements | Redis What is a data strategy? 6 key components explained Data replication explained: types, examples & use cases
Long-Term Memory Architectures for AI Agents
Redis · 2026-04-29 · via Redis

Most AI agents start every session from scratch. Without persistent memory, they're stateless responders that reprocess context on every invocation and can't build continuity across interactions.

Long-term memory changes that. It gives agents external storage that retains information across sessions, interactions, and tasks, beyond what fits in a model's context window at any given moment. Instead of cramming everything into a single prompt, agents selectively retrieve what's relevant from a durable store.

This guide focuses on the architecture behind long-term memory: how it fits into a running agent, the pipeline from raw text to retrievable knowledge, and the tradeoffs you'll face between recall, latency, cost, and forgetting. If you're looking for a broader introduction to agent memory, start with our guide on AI agent memory.

Why agentic AI systems need long-term memory

The core problem is context window limits. LLMs have a fixed context window, and attention complexity makes long-range dependency tracking difficult. Agents that try to memorize while reading can lose information as fixed-length memory gets overwritten and earlier evidence is compressed or discarded.

Bigger context windows help, but they don't remove the need for memory systems. As human-AI relationships develop over weeks or months, conversation history can exceed even extended windows, and full-context methods still have to reason through irrelevant information.

Without persistent memory, agents hit three concrete walls:

  • Personalization dies between sessions: A user tells your agent they prefer Python and deploy to Railway. Next session, the agent has no idea.
  • Long-horizon tasks break: Agents handling multi-step workflows like research projects, debugging sessions, or multi-day code reviews need enough state to resume successfully.
  • Multi-system context evaporates: Enterprise agents pulling data from a CRM, a ticketing system, and an observability stack lose the thread when each call starts cold.

Those failures point to a simple design question: what should the agent remember, and in what form?

Memory types

That question is one most production systems answer by borrowing a taxonomy from cognitive science and splitting long-term memory into three categories:

  • Semantic memory stores facts and concepts independent of time or context: user preferences, domain rules, distilled summaries.
  • Episodic memory records time-indexed experiences and events, like specific conversations or tool calls.
  • Procedural memory captures skills and routines for performing tasks, often encoded in prompts, policies, or agent code.

Most production systems end up using a mix of all three, with episodic memory often getting consolidated into semantic memory over time.

How long-term memory fits inside agentic architectures

Once memory types are decided, the next question is where they live in a running agent. A common pattern is a read-before-reasoning, write-after-acting loop.

Frameworks often follow something like this:

  1. Receive input: Accept a request from a user, trigger, or upstream agent.
  2. Memory read: Load working memory, query the long-term store, and assemble the context window.
  3. Reason and plan: Make an LLM call with memory-injected context.
  4. Act: Make tool calls, API requests, or sub-agent delegations.
  5. Observe: Collect results and feedback.
  6. Memory write: Update working memory, extract facts to the long-term store, and optionally summarize old context.
  7. Loop or terminate: Repeat for the next input, or end the session.

That loop looks simple on paper, but retrieval quality and write discipline usually decide whether it works in production. The hardest part is context assembly: given everything that could go into the context window, what should actually go in?

For all of this to work in production, the memory layer underneath has to hold the different functions in one place. Redis covers all four: short-term memory through in-memory data structures, long-term memory through vector search, operational state through hashes and JSON, and coordination through streams. Cache and state operations stay sub-millisecond, while vector search latency depends on workload and index configuration.

Redis Vector Database

Search meaning, not just keywords

Use Redis vector search to deliver smarter results instantly.

In multi-agent setups, memory gets more complex. Agents can use a shared memory model or keep local memory with explicit synchronization. The right pattern depends on how tightly your agents need to coordinate, and on how much each agent's reasoning depends on what the others already know.

The long-term memory pipeline: from raw text to useful knowledge

Whether memory lives in one agent or many, what arrives at the store is rarely retrieval-ready. Long-term memory works as a pipeline that turns raw interactions into something an agent can retrieve later. Most systems follow the same four stages: chunk the text, embed and index it, retrieve relevant pieces at query time, and consolidate what's worth keeping.

Ingestion & chunking

Chunking is where the pipeline starts. Raw inputs arrive as conversations, documents, or interaction logs, and chunking splits that source text into segments that each get their own vector embedding. That decision shapes retrieval quality more than most teams expect.

Small chunks can improve precision but may split coherent reasoning across boundaries. Large chunks preserve more context but can dilute the signal with irrelevant content. There's another failure mode too: the chunking and embedding process may represent a nuanced insight differently from how it was stored, causing retrieval to return off-target fragments instead of the intended content.

Embedding & indexing

Embedding turns chunks into something a machine can search. Text embeddings are compressed representations where text becomes a fixed-size vector, and similar meanings end up close together in vector space.

Those vectors are then indexed using approximate nearest neighbor (ANN) search structures. Hierarchical Navigable Small World (HNSW) is one common ANN approach at scale, trading a small amount of accuracy for much faster lookups as your dataset grows.

Retrieval

Retrieval is where stored memory becomes usable context. Hybrid retrieval tends to be the strongest default: in one evaluation covering roughly 25,000 question answering pairs across four datasets, term-based retrieval combined with dense retrieval outperformed either method alone. A separate study across eight conversational datasets reported similar gains for hybrid methods over vanilla retrieval-augmented generation (RAG).

For many teams, that means combining full-text indexing with vector search from the start, rather than bolting it on later as an optimization.

Memory consolidation

Consolidation decides what stays as raw episodes and what gets promoted into more durable knowledge. Without it, your memory store grows indefinitely and retrieval quality degrades over time.

Common approaches score memories on recency, importance, and relevance. Beyond scoring, episodic memories often get distilled into semantic knowledge: a fact that stays useful without its original context moves to semantic memory, and the raw episode drops out.

Design tradeoffs: latency, cost & forgetting

Once the pipeline is in place, you're left with the part every team has to live with: tradeoffs. Long-term memory can improve continuity, but it also forces choices around accuracy, latency, cost, and retention.

Accuracy vs. latency & cost

Better recall usually means more context, higher latency, and more tokens. That's the core tradeoff every team building long-term memory runs into, and it shows up clearly in published benchmarks.

In a LOCOMO benchmark study, full-context approaches reported 72.9% accuracy, 17.12s p95 latency, and about 26,031 tokens per conversation. Selective external memory reported 66.9% accuracy, 1.44s p95 latency, and about 1,764 tokens under the paper's test conditions.

That's roughly 91% less latency and about 90% fewer tokens for a 6-point accuracy trade in that benchmark. For most production workloads, giving up a few accuracy points to cut latency by an order of magnitude is the right call, but the specific threshold depends on how much a wrong answer costs you.

Forgetting

Forgetting is one of the least solved parts of memory systems. Storing and retrieving are mostly engineering problems at this point; deciding what to drop is still an open research question.

Selective forgetting remains a major open problem. Current systems are better at storing and retrieving than deciding what to safely forget, and getting it wrong can hurt answer quality, inflate storage costs, or leak stale context into new sessions.

That gap matters in production. Until the research improves, teams still need explicit retention policies and consolidation rules instead of assuming the memory layer will manage itself.

Long-term memory is infrastructure, not a feature

Long-term memory can turn agents from stateless responders into systems that preserve context over time. The architecture is what makes it work: a read-before-reasoning, write-after-acting loop, a pipeline that turns raw text into retrievable knowledge, and explicit rules for consolidation and forgetting. Recall quality, latency, and token cost all follow from how carefully you design those pieces.

Redis brings these primitives together in one real-time data platform, so agent memory layers don't get stitched across separate systems. The Redis Agent Memory Server packages this into an open-source memory layer for agents, with configurable extraction strategies, Model Context Protocol (MCP) integration, and multi-provider LLM support through LiteLLM.

If you're building agents that need to remember, try Redis free to see how vector search and memory management work with your workload, or talk to our team about architecting your agent memory layer.

Memory

You've made it this far

Now see how this actually runs in Redis. Power AI apps with real-time context, retrieval, and semantic caching.