惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
Martin Fowler
Martin Fowler
T
Threatpost
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
C
CERT Recently Published Vulnerability Notes
V
Vulnerabilities – Threatpost
Help Net Security
Help Net Security
Project Zero
Project Zero
博客园 - 聂微东
博客园_首页
T
Tor Project blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
Visual Studio Blog
人人都是产品经理
人人都是产品经理
The Register - Security
The Register - Security
Latest news
Latest news
K
Kaspersky official blog
L
LINUX DO - 热门话题
P
Proofpoint News Feed
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
C
Cyber Attacks, Cyber Crime and Cyber Security
A
Arctic Wolf
aimingoo的专栏
aimingoo的专栏
J
Java Code Geeks
F
Full Disclosure
Recent Announcements
Recent Announcements
SecWiki News
SecWiki News
C
Cybersecurity and Infrastructure Security Agency CISA
F
Fortinet All Blogs
The Hacker News
The Hacker News
Apple Machine Learning Research
Apple Machine Learning Research
NISL@THU
NISL@THU
The GitHub Blog
The GitHub Blog
量子位
Hugging Face - Blog
Hugging Face - Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
P
Palo Alto Networks Blog
T
Troy Hunt's Blog
O
OpenAI News
T
Threat Research - Cisco Blogs
博客园 - Franky
Hacker News - Newest:
Hacker News - Newest: "LLM"
A
About on SuperTechFans
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tenable Blog

Redis

Real-Time Fraud Detection: Latency, Features & Scale Context window in AI: why every token is a budget decision Connecting to Redis Cloud with AWS PrivateLink vs. VPC peering | Redis Redis Data Integration in Redis Cloud is now GA in AWS | Redis Why AI Misses Business Context & How Teams Fix It AI Reasoning Explained: Why Context Matters Semantic Layer vs Context Layer: Key Differences Redis array data type: How it works and when to use it Context Graphs vs. Vector Search: When RAG Falls Short What’s new in two – May 2026 edition Redis 8.8 performance improvements: Faster string, hash, streams, SCAN & more Redis 8.8: New array data structure & open source features How Conflict-free Replicated Data Types power active-active database replication Context Orchestration: What It Is & How It Works Context Compaction for AI Agents: A Complete Guide Prompt Bloat: Causes, Costs & Fixes for LLM Apps Agentic Retrieval Techniques: A Complete Guide Single-shot reliable consumers with XREADGROUP CLAIM in Redis 8.4 | Redis Long-Horizon AI Agents: Memory & State Infrastructure What is a context engine? What Is a Context Layer? AI Agent Infrastructure Context Retrieval for AI Agents: What It Is & Why It Matters Context Poisoning: How Bad Data Breaks Agent Reasoning Context is all you need: Introducing Redis Iris | Redis Context Engineering for AI: What It Is & How to Build It Dynamic endpoints: Migrate databases without changing your endpoint | Redis AI Shopping Assistants: How They Work & What to Build Endless Aisle Retail: Infrastructure & Real-Time Data LLM Speed Benchmarks: Metrics & Infrastructure Guide What’s new in two – April 2026 edition Agentic AI Architecture: 5 Patterns Explained AI Agent vs Chatbot: Key Differences Explained Advantages of Building a Vector Search Solution API Latency in LLM Apps: Causes & How to Fix It Security advisory: [CVE‑2026‑23479] [CVE‑2026‑25243] [CVE-2026-25588] [CVE‑2026‑25589] [CVE-2026-23631] | Redis Edge Computing Latency: Causes & How to Reduce It AI Agents vs Workflows: When to Use Each Streaming LLM Responses: Make Your AI App Feel Fast Active-Active vs Active-Passive Database Architecture Prefill vs Decode: LLM Inference Phases Explained Long-Term Memory Architectures for AI Agents Time to First Byte Test: Tools, Causes & Fixes Speculative decoding: how it works & when to use it P95 Latency: What It Is & Why It Matters Why Multi-Agent LLM Systems Fail & How to Fix Them AI Human in the Loop: Production Oversight Patterns Native OpenTelemetry metrics for Redis client libraries | Redis Client-side geographic failover for Redis Active-Active | Redis Use Redis with SQL | Redis Introducing Redis Feature Form Build Google ADK Agents with persistent, real-time memory on Redis | Redis Startup Spotlight: Neuron Systems API Throttling: Algorithms, Patterns & Mistakes Agentic AI Examples Across 6 Industries Best Chunking Strategies for RAG Pipelines Agentic AI Guardrails: Controls That Work Redis joins AWS at GDC to support the next generation of gaming | Redis Designing a semantic routing system: From static rules to dynamic intelligence with Redis and Java | Redis Real-Time Dispatch System: A Complete Guide P99 Latency: What It Means & How to Fix It Tokenization in LLMs: What AI App Devs Need to Know TTFT Meaning: What is Time to First Token? Atomic slot migration with Redis 8.4 Hybrid search benefits: Why your RAG system needs both keyword & vector search What’s new in two: March 2026 edition Vector embedding generators: How they work & how to use them Throughput-optimizing Redis for L2 KV Cache Reuse What is a data pipeline? Building AI agent pipelines that don't forget, fail, or fall apart Redis achieves Google Cloud Ready, Distributed Cloud status ahead of Google Cloud Next ‘26 | Redis Real-time network monitoring: what your data platform needs to keep up AI agent API: How agents connect to the real world What is multicloud infrastructure? A guide for 2026 What is a transaction monitoring system & how does it work? Why your AI agent fails in production & how tracing helps AI agent benchmarks: Where they fall short & why your infrastructure matters What is a JSON database (and when should you use one)? Introducing the Redis Partner Network: A new foundation for real-time innovation How real-time customer segmentation works in retail Payment orchestration & vault architecture in retail Agentic systems vs. GenAI: when generation isn't enough What is fuzzy matching? Semantic caching & routing: two powerful patterns for vector classification Redis alternatives: Why there are no exact substitutes Connect to Azure Managed Redis with Redis Insight 3.2.0 How to tame the thundering herd problem Redis to Manage Storage Replication | Redis How hierarchical navigable small world (HNSW) algorithms can improve search | Redis How leading financial institutions use Redis to drive growth | Redis What’s new in two: May 2025 | Redis Introducing Model Context Protocol (MCP) for Redis | Redis Redis vs. Elasticsearch: What’s faster for GenAI & vector search? | Redis Build fast, production-worthy AI apps with Spring AI and Redis | Redis Azure Managed Redis is GA today | Redis Redis then & now: Adapting with developers through every era | Redis Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability | Redis What’s new in two: April 2025 | Redis Redis 8 is now GA, loaded with new features and more than 30 performance improvements | Redis What is a data strategy? 6 key components explained Data replication explained: types, examples & use cases
Context Pruning: Cut LLM Tokens Without Losing Quality
Redis · 2026-05-09 · via Redis

Your LLM app is burning through tokens, and most of them aren't doing anything useful. Every retrieved passage, every chunk of conversation history, every piece of boilerplate context costs money, adds latency, and can actually make your model's output worse. Context pruning is the practice of selectively removing low-value tokens, sentences, or passages from an LLM's input before or during inference to reduce cost and improve response quality. It's one piece of context engineering: shaping what reaches the model before inference.

This guide covers what context pruning is, why bigger context windows don't make it optional, and where semantic caching fits alongside pruning in production.

What context pruning actually does

Context pruning selectively removes low-value tokens, sentences, or passages from an LLM's input to cut cost and often improve output quality. It sits within the broader category of prompt compression, which aims to reduce prompt length and improve the efficiency of processing LLM inputs.

Three related practices often get conflated with context pruning:

  • Prompt engineering: manual rewriting of prompts that doesn't reduce token count systematically.
  • Model pruning: removes weights and neurons from the model itself, not the input.
  • Abstractive summarization: generates new text rather than selecting from the original.

Context pruning differs from all three. It operates on the input by selecting or removing existing content, not by rewriting it or modifying the model. Approaches split into four families, organized by what they cut and how they decide what's worth keeping.

Token-level pruning

Token-level pruning is the finest-grained approach: a separate, smaller model reads the input and drops the tokens it scores as low-value. LLMLingua-2 reframes the compression decision as a yes/no classification per token, trained on examples of well-compressed prompts. The paper reported 3x to 6x speedup over earlier methods by swapping a 7B causal model for much smaller encoder models like XLM-RoBERTa-large that evaluate the whole prompt in parallel rather than token by token.

Sentence-level & chunk-level pruning

Sentence- and chunk-level pruning evaluates bigger units. Instead of looking at one token at a time, it scores entire sentences or fixed-size chunks and keeps or discards them whole. This avoids the main risk of token-level pruning, which is leaving behind sentence fragments the model has to stitch back together. It also fits retrieval-augmented generation (RAG) pipelines well, since retrieved passages often mix useful sentences with whole irrelevant ones. The trade-off is granularity: keeping a sentence keeps every token in it, including the filler ones.

Redis Iris

Build fast, accurate AI apps that scale

Get started with Redis for real-time AI context and retrieval.

Attention-based pruning

Attention-based pruning uses the model's own attention patterns to decide what stays. Transformer attention scores measure how much each token influences the output, and tokens that consistently get ignored make good pruning candidates. Evaluator Head-based Prompt Compression (EHPC) picks specific attention heads that reliably identify relevant tokens, then uses their signals to score importance. The appeal: no auxiliary scoring model required, since the LLM is already computing attention during inference.

Dynamic layer-progressive pruning

Dynamic layer-progressive pruning happens during inference, not before it. As input flows through a transformer's layers, the model gradually absorbs which tokens matter, and progressive pruning takes advantage: cut more aggressively at deeper layers, where the signal has already propagated outward. SlimInfer leans on an "information diffusion" effect. Important context spreads to surrounding tokens layer by layer, so deeper layers can run on a much smaller subset of the original input.

A few cross-cutting distinctions matter for production decisions. The first is the output format. Hard methods produce compressed text: actual tokens you can send to any LLM, including API-only models. Soft methods produce learned embeddings: vectors that replace the original input and feed directly into the model's embedding layer. Hard methods work anywhere; soft methods need access to the model's internals, which rules out closed APIs but often gets higher compression in exchange. Static pruning happens once before inference. Dynamic pruning happens during the forward pass. And granularity ranges from individual tokens to entire documents, with finer granularity typically achieving higher compression at potential cost to fluency.

Bigger context windows don't solve this on their own

Every time a new model ships with a longer context window, the case for pruning gets re-litigated. The answer hasn't really changed: bigger windows haven't fixed long-context failure modes, and in some setups extra tokens make output worse.

LLMs struggle to use middle-context info in long inputs. Performance peaks when relevant content sits at the beginning or end and drops when it's buried in the middle. This U-shaped curve has a name in the literature: "lost in the middle."

Input length itself can degrade performance, independent of what's in the input. A 2025 study isolated input length from content changes and reported one tested model dropping 67.6 points on MMLU at 30K padding tokens.

The advertised maximum is often longer than the practical one. The RULER benchmark found effective length can be much shorter than the spec, and a separate study reported degradation past 100K in models claiming 1M-token windows. Behavior also varies by model: one LongBench V2 evaluation found GPT-4o improved at 128K while other models deteriorated beyond 32K.

There's no fixed token threshold where pruning becomes necessary, but adding more context to a larger window often hurts more than it helps.

The numbers: what pruning can save

The benchmarks for pruning are favorable. Moderate pruning can preserve quality, and in some evaluated tasks even improve it.

The original LLMLingua measured up to 20x compression in its reported evaluation, with about a 1.5-point performance loss on GSM8K and BBH and larger drops in some BBH settings at higher ratios. It still reported 1.7x to 5.7x latency speedup on a V100 GPU.

Key-value (KV) cache methods show a similar pattern. The KV cache stores intermediate attention states during inference, and pruning it reduces both memory and compute. MUSTAFAR reported 55% KV cache reduction and up to 2.23x throughput increase in tokens per second while preserving accuracy. FastKV measured 1.82x prefill speedup and 2.87x decoding speedup, matching the decoding-only baseline on accuracy.

The pattern shows up in broader evaluation work too. An empirical study found that "moderate compression even enhances LLM performance" on the Longbench evaluation, which aligns with a reported reasoning decline in one setup near 3,000 tokens.

One caveat: no single method dominates across all tasks. A method matters benchmark study found that compression outcomes vary by task type, so method selection often needs to be domain-specific.

LLM memory 64px

Make your AI apps faster and cheaper

Cut costs by up to 90% and lower latency with semantic caching powered by Redis.

Where context pruning breaks down

Pruning has real failure modes, and you need to design around them. The benchmark wins from earlier come with trade-offs that show up the moment you push pruning into production. Mismanaged context surfaces as context poisoning (bad data sticking around), distraction (relevant signal buried in noise), confusion (the model latching onto irrelevant tokens), and clash (retrieved chunks that contradict each other). Pruning helps with some of these and worsens others.

Information loss & hallucination

Compression can increase hallucination when you cut too much signal along with the noise. An empirical study reported that tested compression methods increased hallucination to some degree, with information loss identified as one factor. For short contexts, quality typically decreases as you compress more, because there's less noise to safely remove. Query-aware methods help here, since they preserve tokens most relevant to the specific question.

Code & structured data

Token-level pruning that works on prose can fall apart on code, because removing individual tokens can break syntactic validity. On the SWE-Bench coding benchmark, the domain-specific SWE-Pruner reported 64% task success while LLMLingua-2 dropped to 54% on tasks. For code, chunk-level pruning that retains or discards entire logical units (function definitions, class blocks) works best.

Multi-turn conversation

Pruning conversation history can break discourse continuity. On the LoCoMo long-form dialogue benchmark, reported quality differences varied by approach relative to full context. Guidance for managed agents also warns that selective context retention can fail because future turns may need tokens that seem irrelevant now. A dual-tier memory pattern helps. Working memory holds the current session, long-term memory holds extracted facts pulled out over time. Pruning the working tier without losing long-term signal is easier than pruning a flat conversation log.

Compounded degradation

Pruning combined with quantization and other optimizations produces non-linear quality degradation. Some studies have reported task trade-offs under optimization settings such as pruning and quantization. Evaluate pruned systems across multiple task types at once, not one benchmark at a time.

Context pruning & semantic caching

All of those failure modes are easier to manage when pruning isn't the only optimization layer in your stack. Pruning works best as one piece of a broader system, paired with semantic caching upstream. Semantic caching compares vector embeddings of incoming queries against past ones, and when a new query is semantically similar to a previously answered one, the system returns the cached response instead of invoking the LLM. Context pruning kicks in on cache misses, trimming the retrieved context before it reaches the model.

The workflow is straightforward: a query comes in, the system checks the semantic cache, and on a hit it returns the cached response with no retrieval, pruning, or inference needed. On a miss, the system retrieves relevant context, prunes it, sends the pruned context to the LLM, and stores the response for future hits.

This layered approach helps in three ways. Semantic caching reduces how often pruning has to happen in the first place, so the same conceptual question phrased five different ways doesn't trigger five full retrieval-prune-inference cycles. Cleaner, pruned input also tends to produce better responses to cache. And the same vector search infrastructure can power both retrieval for pruning decisions and the cache lookup itself.

Redis acts as a real-time context engine that gathers, syncs, and serves the data AI pipelines depend on, so cache lookups and retrieval for pruned context run on the same infrastructure. In a billion-vector benchmark, Redis reported 90% precision at ~200ms median latency under 50 concurrent queries retrieving the top 100 neighbors. Redis LangCache, a fully managed semantic caching service available via REST API, reported up to 15x faster responses on cache hits and up to 73% lower costs in Redis benchmarks. Upstream, hybrid retrieval that combines full-text and vector search can reduce how much pruning the pipeline has to do at all.

User profile storage

You've made it this far

Now see how this actually runs in Redis. Power AI apps with real-time context, retrieval, and semantic caching.

Prune context before you scale context windows

Context pruning does more than save money. Multiple studies report that moderate, task-appropriate pruning can improve LLM outputs compared with dumping everything into a massive context window. The key is matching the right pruning technique to your domain: token-level methods for general document question answering, chunk-level methods for code and structured data, and query-aware approaches when accuracy matters most.

That same takeaway is why the infrastructure layer matters. Context engineering happens at the data layer: where you store retrieved chunks, where you cache responses, where you split working memory from long-term memory. Redis collapses those pieces into one stack so the engineering team isn't stitching three databases together. If you're spending too much on LLM inference or seeing quality degrade as your context grows, context pruning is worth adding to your pipeline. Try Redis to build with vector search and semantic caching, or talk to us about optimizing your AI infrastructure.