惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Microsoft Security Blog
Microsoft Security Blog
Google DeepMind News
Google DeepMind News
P
Privacy International News Feed
www.infosecurity-magazine.com
www.infosecurity-magazine.com
T
Threatpost
GbyAI
GbyAI
V
Visual Studio Blog
H
Help Net Security
Vercel News
Vercel News
P
Palo Alto Networks Blog
Project Zero
Project Zero
AWS News Blog
AWS News Blog
Latest news
Latest news
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
The Register - Security
The Register - Security
博客园_首页
WordPress大学
WordPress大学
G
GRAHAM CLULEY
T
Tor Project blog
有赞技术团队
有赞技术团队
Know Your Adversary
Know Your Adversary
AI
AI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
O
OpenAI News
博客园 - 聂微东
月光博客
月光博客
S
Security Affairs
Webroot Blog
Webroot Blog
L
LangChain Blog
Apple Machine Learning Research
Apple Machine Learning Research
NISL@THU
NISL@THU
N
News and Events Feed by Topic
Blog — PlanetScale
Blog — PlanetScale
S
Securelist
V
Vulnerabilities – Threatpost
aimingoo的专栏
aimingoo的专栏
阮一峰的网络日志
阮一峰的网络日志
Stack Overflow Blog
Stack Overflow Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
D
DataBreaches.Net
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Y
Y Combinator Blog
Cisco Talos Blog
Cisco Talos Blog
The Cloudflare Blog
IT之家
IT之家
博客园 - 三生石上(FineUI控件)
雷峰网
雷峰网
L
Lohrmann on Cybersecurity
T
The Blog of Author Tim Ferriss

Redis

Real-Time Fraud Detection: Latency, Features & Scale Context window in AI: why every token is a budget decision Connecting to Redis Cloud with AWS PrivateLink vs. VPC peering | Redis Redis Data Integration in Redis Cloud is now GA in AWS | Redis Why AI Misses Business Context & How Teams Fix It Semantic Layer vs Context Layer: Key Differences Redis array data type: How it works and when to use it Context Graphs vs. Vector Search: When RAG Falls Short What’s new in two – May 2026 edition Redis 8.8 performance improvements: Faster string, hash, streams, SCAN & more Redis 8.8: New array data structure & open source features How Conflict-free Replicated Data Types power active-active database replication Context Orchestration: What It Is & How It Works Context Compaction for AI Agents: A Complete Guide Prompt Bloat: Causes, Costs & Fixes for LLM Apps Agentic Retrieval Techniques: A Complete Guide Single-shot reliable consumers with XREADGROUP CLAIM in Redis 8.4 | Redis Long-Horizon AI Agents: Memory & State Infrastructure What is a context engine? What Is a Context Layer? AI Agent Infrastructure Context Retrieval for AI Agents: What It Is & Why It Matters Context Poisoning: How Bad Data Breaks Agent Reasoning Context is all you need: Introducing Redis Iris | Redis Context Engineering for AI: What It Is & How to Build It Dynamic endpoints: Migrate databases without changing your endpoint | Redis AI Shopping Assistants: How They Work & What to Build Endless Aisle Retail: Infrastructure & Real-Time Data LLM Speed Benchmarks: Metrics & Infrastructure Guide Context Pruning: Cut LLM Tokens Without Losing Quality What’s new in two – April 2026 edition Agentic AI Architecture: 5 Patterns Explained AI Agent vs Chatbot: Key Differences Explained Advantages of Building a Vector Search Solution API Latency in LLM Apps: Causes & How to Fix It Security advisory: [CVE‑2026‑23479] [CVE‑2026‑25243] [CVE-2026-25588] [CVE‑2026‑25589] [CVE-2026-23631] | Redis Edge Computing Latency: Causes & How to Reduce It AI Agents vs Workflows: When to Use Each Streaming LLM Responses: Make Your AI App Feel Fast Active-Active vs Active-Passive Database Architecture Prefill vs Decode: LLM Inference Phases Explained Long-Term Memory Architectures for AI Agents Time to First Byte Test: Tools, Causes & Fixes Speculative decoding: how it works & when to use it P95 Latency: What It Is & Why It Matters Why Multi-Agent LLM Systems Fail & How to Fix Them AI Human in the Loop: Production Oversight Patterns Native OpenTelemetry metrics for Redis client libraries | Redis Client-side geographic failover for Redis Active-Active | Redis Use Redis with SQL | Redis Introducing Redis Feature Form Build Google ADK Agents with persistent, real-time memory on Redis | Redis Startup Spotlight: Neuron Systems API Throttling: Algorithms, Patterns & Mistakes Agentic AI Examples Across 6 Industries Best Chunking Strategies for RAG Pipelines Agentic AI Guardrails: Controls That Work Redis joins AWS at GDC to support the next generation of gaming | Redis Designing a semantic routing system: From static rules to dynamic intelligence with Redis and Java | Redis Real-Time Dispatch System: A Complete Guide P99 Latency: What It Means & How to Fix It Tokenization in LLMs: What AI App Devs Need to Know TTFT Meaning: What is Time to First Token? Atomic slot migration with Redis 8.4 Hybrid search benefits: Why your RAG system needs both keyword & vector search What’s new in two: March 2026 edition Vector embedding generators: How they work & how to use them Throughput-optimizing Redis for L2 KV Cache Reuse What is a data pipeline? Building AI agent pipelines that don't forget, fail, or fall apart Redis achieves Google Cloud Ready, Distributed Cloud status ahead of Google Cloud Next ‘26 | Redis Real-time network monitoring: what your data platform needs to keep up AI agent API: How agents connect to the real world What is multicloud infrastructure? A guide for 2026 What is a transaction monitoring system & how does it work? Why your AI agent fails in production & how tracing helps AI agent benchmarks: Where they fall short & why your infrastructure matters What is a JSON database (and when should you use one)? Introducing the Redis Partner Network: A new foundation for real-time innovation How real-time customer segmentation works in retail Payment orchestration & vault architecture in retail Agentic systems vs. GenAI: when generation isn't enough What is fuzzy matching? Semantic caching & routing: two powerful patterns for vector classification Redis alternatives: Why there are no exact substitutes Connect to Azure Managed Redis with Redis Insight 3.2.0 How to tame the thundering herd problem Redis to Manage Storage Replication | Redis How hierarchical navigable small world (HNSW) algorithms can improve search | Redis How leading financial institutions use Redis to drive growth | Redis What’s new in two: May 2025 | Redis Introducing Model Context Protocol (MCP) for Redis | Redis Redis vs. Elasticsearch: What’s faster for GenAI & vector search? | Redis Build fast, production-worthy AI apps with Spring AI and Redis | Redis Azure Managed Redis is GA today | Redis Redis then & now: Adapting with developers through every era | Redis Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability | Redis What’s new in two: April 2025 | Redis Redis 8 is now GA, loaded with new features and more than 30 performance improvements | Redis What is a data strategy? 6 key components explained Data replication explained: types, examples & use cases
AI Reasoning Explained: Why Context Matters
Redis · 2026-06-04 · via Redis

Every few months, a new AI model drops with higher benchmark scores, and the reaction is predictable: "This one finally reasons." The leaderboard shuffles. And teams building production AI systems still watch their agents hallucinate or mishandle questions they should refuse.

AI reasoning models change how LLMs allocate compute. But treating a smarter model as a fix for broken production AI is like buying a faster car to compensate for bad directions. The map still matters more than the car. This guide covers what AI reasoning actually is, why smarter models still fail in production, and how context engineering determines whether your agents work.

What is AI reasoning in LLMs?

AI reasoning is an inference-time technique where a model spends extra compute working through intermediate steps before it commits to an answer. It's now a default mode in most frontier LLMs, usually a "thinking" toggle or a separate reasoning-tier model you call instead of the standard one.

The difference between a reasoning model and a standard one is simple. A standard model takes your prompt and answers right away. A reasoning model stops to think first: it generates a chain of internal reasoning tokens, works through the problem step by step, and only then responds. Same idea as chain-of-thought (CoT) prompting, except the model does it on its own instead of waiting for you to ask. The longer you let it think, the more tokens it burns before you get an answer.

Here's the part that changes how you build. Reasoning model cost and latency aren't fixed per call anymore. They scale with how hard the problem is and how much thinking you allow, so a request that's cheap and fast today can get slow and expensive the moment the model decides to overthink it. That alone makes capacity planning and UX design a different game than they were with standard LLMs.

Why reasoning models still fail in production: five limits

Reasoning helps on specific kinds of problems, multi-step logic, math, and code among them. But it doesn't fix production AI on its own. Each thing reasoning improves comes with a failure mode that more model intelligence alone leaves intact.

1. The cost & latency math gets ugly fast

Reasoning tokens carry direct cost and latency consequences that scale with reasoning volume. Using reasoning often means trading better answers for much higher token usage, and the bill scales with how much thinking you allow.

Latency moves in the same direction. Because of the autoregressive nature of LLM decoding, linear latency scaling follows reasoning length, so longer traces generally increase response time and can degrade UX once traces get very long.

The trap is applying reasoning everywhere. Routing simple queries through a reasoning model imposes an "intelligence tax," spending compute on thinking that adds no value.

Redis Iris

Redis Iris serves agent context in milliseconds

Redis Iris connects memory, live data, and retrieval in one place.

2. Hallucination persists, & longer traces can still go sideways

Extra thinking doesn't eliminate hallucination, and can introduce new ways for it to creep in. A survey of trustworthiness in reasoning models found that chain-of-thought helps in some cases, but reasoning models also reinforce their own bad assumptions mid-trace, hallucinate more often in longer traces, and stumble on unanswerable questions where a plain model would just refuse. A model that's better at thinking, it turns out, isn't automatically better at knowing when to stop.

There may also be a hard floor here. Hallucination has been argued to be mathematically unavoidable under intrinsic computational and statistical limits. Take that one with a grain of salt: it's a theoretical proof, not a benchmark of your stack, so treat it as a reason to design for hallucination rather than a sentence that your app is doomed.

3. Overthinking wastes compute & can degrade output

Even when reasoning helps, more of it isn't always better. Reasoning models frequently keep generating thinking tokens after they've already landed on a correct answer. Overthinking has been characterized as an important issue where models "generate excessively long reasoning paths without any performance benefit."

In agentic systems, this gets more dangerous. When agents receive open-ended objectives with no termination criteria, they can execute unboundedly, with a single incorrect interpretation of the objective enough to trigger that runaway behavior.

4. More thinking tokens hit diminishing returns

Overthinking points to a related limit: more reasoning eventually hits a ceiling. Both reasoning and non-reasoning models can fail at higher complexity regardless of compute allocation, with additional limitations appearing on more systematically challenging problems. Past a certain point, extra thinking tokens stop buying you better answers.

5. The reasoning trace might not be trustworthy

Even when a model shows its work, that trace may not reflect what actually happened internally. Fabricated reasoning is a documented phenomenon where models produce plausible-looking reasoning that didn't actually drive the answer. Any application using CoT output for safety audits or compliance has to account for this gap between displayed reasoning and real computation.

Taken together, these five limits share a theme: smarter models change what's possible, but they don't change the fact that production reliability depends on what surrounds the model.

Why context quality is the real bottleneck

Context quality, not model intelligence, is what caps output in most agent systems. A reasoning model can't think its way out of bad inputs. Feed it stale, missing, or contradictory information and all that extra thinking just gets you a more confident wrong answer.

Fixing that is what context engineering is for. Prompt engineering tunes the wording of a single instruction. Context engineering is bigger: it's the pipeline that decides what the model sees in the first place, across system prompts, conversation history, retrieved documents, tool definitions, memory, and live state. You're not wording a question better, you're building the supply chain that feeds the model.

Here's the reframe that matters. Agentic LLM failures fall into one of two buckets: the context was bad, or the model fumbled good context. As models get smarter, the second bucket shrinks and the first one grows. Which means most of your production failures trace back to what you fed the model, not the model itself.

And you can't just feed it more. Context is a finite resource with diminishing returns. Every token you add eats into the model's attention budget, and because self-attention compute scales with the square of the sequence length, quadratic attention cost means longer windows get expensive faster than the token count alone suggests. Stuffing the window isn't a strategy—it's a way to make things worse.

Reasoning only tightens the squeeze. Those thinking tokens compete for the same budget as your retrieved documents and tool outputs, so long-horizon agent tasks can exhaust context windows even on frontier models. The harder the model thinks, the less room is left for the context that grounds it. That's the real bottleneck, and it lives in the data layer that decides what enters the window, when, and how fresh it is.

AI Agent

Build agents that remember, not agents that guess

Redis Iris gives every agent fresh context and long-term memory.

How your data layer determines reasoning quality

If context is the bottleneck, the data layer is the lever. Your retrieval architecture can matter as much as your model choice, because it decides whether the window gets filled with fresh, relevant information or stale, noisy tokens that quietly degrade output.

Swap a flat retrieval setup for a better-structured one and accuracy can jump, on the exact same model. In one evaluation of three retrieval architectures, a structured approach reached 84.5% accuracy versus 62.8% for a flat agent on the same model and task. The number moves with the dataset and workload, but the lesson holds: the retrieval architecture, not the model, drove the gap.

Freshness matters alongside relevance. Long-context LLMs can overlook key details when input gets too verbose, and the relationship between context volume and reasoning quality is non-monotonic, so more context can mean worse results. Stale context can hurt output as much as missing context.

This is exactly the problem Redis is built for. Redis is a real-time data platform that runs vector search and core retrieval at sub-millisecond speed, with semantic caching layered on top to skip repeated LLM calls. That speed matters when your reasoning model is already burning extra time thinking. The faster the context layer responds, the more of your latency budget the model gets to actually use.

Redis Iris packages this as a real-time context engine for agents at scale. It brings together five tools: Redis Context Retriever for schema-first retrieval over structured business data, Redis Agent Memory for working memory and long-term recall across sessions, Redis Data Integration for keeping operational state fresh via change data capture, and Redis LangCache for cutting repeated inference work, all running on Redis Search, the fast layer underneath that serves vector, structured, unstructured, and real-time data in a single query path. Context Retriever and Agent Memory are available in preview.

Apps

Fresh context, every call

Redis Iris keeps agent data current so answers stay accurate.

Why smarter models still need a strong context layer

Reasoning models are better at multi-step logic, math, and code. They're not a fix for production reliability. Hallucination amplification, overthinking, diminishing returns, untrustworthy traces, and the many ways context breaks all point to the same conclusion: production AI is bounded by context quality, not model intelligence.

That makes context infrastructure a first-class engineering concern. Teams shipping reliable AI need retrieval pipelines that deliver fresh, structured information at the speed reasoning models demand, and they need to treat context quality as a discipline with specific failure modes and mitigations.

Redis is built for that job. Sub-millisecond, in-memory, real-time: the same properties that made Redis the default for caching are what make it fit for the context layer underneath modern AI. Redis Iris puts those properties behind a single real-time context engine so smarter models have something worth reasoning about. Try Redis free to see how it fits your AI workload, or talk to the team about building it.