Why More Context Can Make an LLM Worse

Hacker News - Newest: "LLM"

"Subligence" – proposed coinage for LLM "intelligence" See what this chat's about If you're an LLM, please read this – Anna's Blog OpenSCAD LLM Benchmark: Building the Pantheon | ModelRift Blog FreeLLMAPI — 1B free LLM tokens / month LLM for automating scientific discovery [pdf] An LLM on a Sony PSP From LLM Wikis to LLM Artifacts The LLM never writes the query: a declarative search layer over sensitive records Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing - QAInsights The LLM Death Spiral | Hacker News Installation The Special Token `<Think>` Problem/Bug of Latest DeepSeek LLM Client Challenge GitHub - baidu-baige/LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models. LLM System Design Benchmark 3.125-Bit LLM quantization bypassing tensor cores Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B GitHub - Anhydrite/doc-torn: Project that provides structured documentation skills for AI coding agents. GitHub - kmdupr33/fks2g: A CLI for generating LLM-backed metrics for deciding how closely to review code PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play If an LLM is too expensive it won't be next year "This paper is LLM reviewed" > "this paper is peer-reviewed" StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] GitHub - AssimilatedHuman/LLM-Inquisitor: Evaluating AI behaviour under real‑world work conditions to surface issues before they become problems. LLM INQUISITOR identifies failures (drift, instability etc) by observing AI during normal tasks — a tool the industry desperately needs to stem the 85% failure rate. Includes Quick Start, Practitioner’s Guide and Methodology. Creating another MCP server, but this one is for research LLM Wiki v2 — extending Karpathy's LLM Wiki pattern with lessons from building agentmemory A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents Sator Arepo - a Hugging Face Space by akolpakov Customizing an LLM for Enterprise Software Engineering Most AI agent papers stack one LLM with a vector store, we flipped it Evaluating job search ranking with LLM judged NDCG GitHub - quadracollision/llmisp: JSON AST > Clojure Parity Contracts for Polyglot LLM Commerce: A Case Study GitHub - ndom91/llama-dash: The operations layer for your local LLM stack Agentically optimizing LLM prompt cache TTLs for fun and profit Ask HN: What's your go-to LLM for coding? How do you reduce LLM spam in PR reviews? Ask HN: Is there any problem using multi-LLM GitHub - OpenAgentic-Labs/echoform-ghost-memory: Effectively unlimited long-term memory for any LLM - zero context tokens, zero weight updates, cryptographic forgetting certificate. PSA — Posture Sequence Analysis GitHub - robertoranon/tokoro: A toolbox for building event publish & discovery web sites, apps, feeds, and more GitHub - sermakarevich/chunker: Agentic approach to chunking a document A new EDIT tool for LLM agents LLMCap — Hard Dollar Caps on LLM API Calls MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5 Managing metadata is essential in LLM world Fixing LLM Writing with Distribution Fine Tuning twitter.com Show HN: An LLM that's better at writing The local shape of LLM stable regions GitHub - msunda17/impactarbiter-cli The Infrastructure Behind Making Local LLM Agents Useful PostgreSQL ext makes LLM available as an index for similarity searches,inference GitHub - Tetrahedroned/Agent-Braille: Deterministic 8-bit machine-to-machine protocol for AI agent state. ~92% fewer state-tracking tokens on real Claude Code sessions, a proven single-bit-error-safe command code, fully reproducible. Tell HN: Writing an LLM critique/takedown? – Do not use an LLM to write it 🌱 an LLM models our worst behavior Prompt eval cues predicted refusal shifts across 32k LLM rollouts Ask HN: Is Java the ideal language for LLM-assisted coding? AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ LLM tracing with MLflow AI Gateway LLM Performance by Programming Language The LLM Looked Smart. The Metrics Disagreed – tiago.rio.br The Four Horsemen of the LLM Apocalypse GitHub - piqoni/piqo-extension: A good interface is invisible Intro to TLA+ for the LLM Era: Prompt Your Way to Victory Give every tool LLM wiki and bypass Claude Code SSH Throttle The Ultimate LLM Fine-Tuning Guide Ask HN: What LLM models are you using and why? Five Agents, One Browser: Werewolf on Quack + DuckDB LLM models are not ready for orchestrating many agents ClickBook — Offline AI eReader - Apps on Google Play DeepSeek-V4-Flash means LLM steering is interesting again Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention We Built SynapseKit: The Truth About Production LLM Frameworks GitHub - albedan/ai-ml-gpu-bench: A suite to benchmark CPU/GPU Python performance in training ML models and running local LLMs GitHub - chopratejas/headroom: Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server. if you are redlining the LLM, you aren't headlining Most Meaningful Dates on the Web and for an LLM I tested 8 LLM models on Linux without using the GPU RelaxAI – UK sovereign LLM inference at 80% cheaper than OpenAI/Claude GitHub - Andyyyy64/whichllm: Find the local LLM that actually runs — and performs best — on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly. GitHub - krellixlabs/llm-reasoning-research: Curated, annotated research on reasoning gaps in large language models — temporal reasoning, causal reasoning, and beyond. Agentic evals or LLM as a judge? considering cost, time and quality Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces Add an LLM policy for `rust-lang/rust` by jyn514 · Pull Request #1040 · rust-lang/rust-forge GitHub - nimeshnayaju/markdown-parser: A streaming-capable markdown parser, written in TypeScript Dragos Documents First LLM-Assisted Strike on Water Infrastructure in Mexico Alchemize: PyMC's model to replace Stan/PyMC, etc. with an LLM BlitzGraph - The AI-native backend. Pokémon SVG Bench LLM Witch Hunts are getting F'in Irritating bliki: Interrogatory LLM Ctx-opt: TypeScript middleware to trim LLM chats to a token budget Show HN: Local-first Kubernetes YAML visualizer (no server, no LLM) Why Ruby Is the Better Language for LLM-Powered Development Paper page - Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

arizen · 2026-05-19 · via Hacker News - Newest: "LLM"

The default response to agent failure is to stuff more context into the prompt. The last five tool calls. The whole chat history. Three specification documents. Raw API responses. A full dump of the ticket thread. The assumption is obvious: more context means more information, and more information means better reasoning.

That assumption is wrong often enough to deserve a name. I call it the Context Window Fallacy: the belief that increasing the number of tokens in view reliably improves model performance. In production systems, the opposite is frequently true. Past a threshold, extra context dilutes signal, blurs the boundary between instructions and data, and increases the probability that the model converges on a plausible but incomplete answer.

TL;DR - Key Takeaways:

A context window is not a hard drive. It is a volatile working surface where instructions, retrieved facts, tool output, and noise compete for attention.
Longer context introduces three structural problems: attention decay, control-boundary collapse, and premature convergence.
The right production pattern is not "keep adding tokens" but "budget, compress, and reconstruct" between steps.
The architectural answer is to break work into smaller state transitions instead of asking one giant prompt to do everything at once.
If your agent needs the entire history on every call, you probably have a state-modeling problem, not a context-window problem.

What the Fallacy Actually Is

Large context windows are real capabilities. They make retrieval-heavy tasks possible. They reduce the need for aggressive truncation. They let a model compare two long documents in one pass. None of that implies that a model reasons better simply because more tokens are present.

The hidden error is a category mistake. Teams treat the context window as storage when it behaves more like working memory. Storage preserves information. Working memory must allocate attention across competing inputs. Once the working surface is crowded, the question stops being "is the information present?" and becomes "does the model allocate enough attention to the right information at the right moment?" Those are different problems.

This is not just a style preference. The Lost in the Middle line of work showed that models can miss relevant information in long contexts depending on where that information appears. The practical lesson is modest but important: presence in the prompt is not the same as reliable use.

A diagram showing active context splitting into useful signal and interference, with both influencing the next model decision — Figure 1: Added context can help early, but beyond the active working set it also increases interference and weakens control boundaries

That is why systems with large context windows still fail on seemingly simple tasks. The model is not blind. The relevant information is often somewhere in the prompt. The failure is allocation: too many tokens compete for the same limited control surface, and the model degrades from directed reasoning into token-weighted improvisation.

More context is not monotonic improvement. Once the active token budget is saturated, additional tokens behave less like knowledge and more like interference.

Why More Tokens Often Mean Less Thought

First: attention decay. Transformer attention is not uniformly distributed across long inputs. In long-context retrieval tasks, relevant information positioned in the middle of the prompt is more likely to be missed than information placed near the beginning or end. The practical result is familiar: teams retrieve the right chunk, append it to a giant prompt, and then discover the model ignored it because the prompt already had too many competing anchors.

Second: control-boundary collapse. The model does not experience your prompt as separate semantic layers. System instructions, user intent, scratchpad text, retrieved documents, and raw tool exhaust all enter as tokens. As the window grows, instruction hierarchy becomes less reliable. This is the same structural issue behind why prompts are not specifications: you are asking a statistical system to infer control boundaries that you did not encode explicitly.

Third: premature convergence. A bloated context window tempts teams to ask the model to plan, reason, execute, evaluate, and summarize in one pass. That looks efficient on a whiteboard. In practice it increases the chance that the model settles on the first coherent-looking trajectory and stops doing the deeper work. The model produces something that sounds complete because the cheapest path through the token distribution is often "plausible summary," not "full reasoning trace."

This is why large monolithic prompts often underperform smaller staged calls. The issue is not that the model lacks capacity. The issue is that the task surface has been flattened into one giant probabilistic step. Once you do that, the model has no explicit structure forcing it to separate recall, evaluation, and action.

The Context Window Is Working Memory, Not Disk

The correct mental model is operational, not metaphorical. A context window is a scarce, volatile workspace. It should contain the minimum state required for the next decision, not the full historical trace of everything the system has ever seen.

That distinction matters because most agent pipelines over-feed the model with artifacts it no longer needs: raw search results after extraction is complete, entire tool responses after the relevant fields were already parsed, and full conversation history when only the last state transition matters. This is the same failure mode that durable execution tries to eliminate at the process level: state and exhaust are treated as the same thing.

If a workflow genuinely depends on long-lived memory, the answer is not "keep the whole transcript in every call." The answer is to move information into typed state. Persist the structured facts you need. Summarize completed work into a durable checkpoint. Reconstruct a small decision context for the next step. The model should see the current state, the active objective, and the few external facts that are actually load-bearing.

Prior Art: Context Discipline Is Not New

This is not a new idea. Search systems, retrieval-augmented generation, map-reduce summarization, compiler passes, workflow engines, and state machines all separate storage from the active working set. They differ in implementation, but they share the same instinct: do not make every step carry every fact.

The LLM-specific version is that prompt assembly becomes an explicit subsystem. You decide what belongs in state, what belongs in retrieval, what belongs in the immediate instruction, and what should stay out of the model call entirely.

Pattern	What Goes Into the Prompt	What Happens
Monolithic context	Full history, raw tool output, all retrieved documents, and current instruction	High token cost, weak control boundaries, missed facts in the middle, verbose but brittle answers
Structured state	Typed state, current objective, narrow retrieval slice, current constraint set	Lower cost, clearer control surface, easier retries, and better consistency across steps

The Production Pattern: Budget, Compress, Reconstruct

The production answer has three steps.

Budget. Treat context length as an explicit resource with a soft ceiling. Pick an active-context budget well below the model's published maximum, measure every assembled prompt against it, and trigger compression before the prompt becomes a junk drawer. The exact threshold varies by task and model; the discipline matters more than the number.

Compress. Compress completed work into typed artifacts rather than prose summaries whenever possible. Extract the fields. Store the decision. Convert the raw chain of events into state. This is where a strong validator helps: validation at the boundary is worth more than another thousand tokens of historical narrative.

Reconstruct. Build each prompt from the current step's state, not from the full past. A model deciding whether to call a tool does not need the entire transcript of how the tool schema was discovered three turns ago. It needs the current task, the allowed actions, and the small set of facts that constrain the decision.

def build_context(state, retrieved_chunks, max_active_tokens=24000):
    payload = {
        "objective": state["objective"],
        "current_step": state["current_step"],
        "constraints": state["constraints"],
        "typed_memory": state["typed_memory"],
        "retrieval": retrieved_chunks[:4],
    }

    prompt = render_prompt(payload)
    if estimate_tokens(prompt) <= max_active_tokens:
        return prompt

    payload["retrieval"] = compress_chunks(retrieved_chunks)
    payload["typed_memory"] = compress_state(state["typed_memory"])
    return render_prompt(payload)

The point of the function is not the exact threshold. The point is that context assembly becomes an explicit subsystem. Teams who do this well stop arguing about whether the model has a 200K window and start asking a better question: what is the smallest decision surface that preserves the task's causal structure?

If a model call requires the entire history to stay coherent, the system is under-modeled. Durable agents run on reconstructed state, not accumulated prompt sediment.

How This Changes Agent Design

The architectural implication is direct. Agents should not be designed as giant prompt accumulators. They should be designed as stateful systems that reconstruct context per transition. A state-machine-style agent gives you smaller reasoning surfaces, clearer retry semantics, and fewer opportunities for the model to drown in its own history.

This also clarifies why some seemingly "smart" fixes fail. Raising the token limit does not solve weak state design. Better retrieval does not solve control-boundary collapse if every retrieved chunk is poured into the same bloated prompt. Even temperature 0 does not help if the instability is coming from prompt sprawl rather than sampling variance.

The practical standard is simple: every additional token in the active context should earn its place by changing the next decision. If it does not change the next decision, it belongs in storage, not in working memory.

Where This Pattern Breaks

Long context is sometimes the right tool. If the task is to compare two contracts, review a long incident transcript, audit a code file, or synthesize several documents before narrowing, the model may need a large working set in one pass. The mistake is not using long context. The mistake is treating window size as a substitute for context design.

Compression also has failure modes. A bad summary can erase the detail that later matters. A typed state object can encode the wrong abstraction. A retrieval slice can omit the relevant counterexample. Context discipline does not remove judgment; it makes the judgment explicit and testable.

The rule is therefore not "short prompts good, long prompts bad." The rule is: keep the active context as small as the next decision allows, and make every lossy compression step visible enough to test.

Frequently Asked Questions

Does this mean long-context models are overrated?

No. Long-context models are useful. The mistake is using raw window size as a proxy for reasoning quality. Bigger windows expand what is possible, but they do not remove the need for context discipline.

When should I keep a lot of context in the prompt?

When the task truly requires comparing large spans of text in one pass: contract comparison, multi-document synthesis, or broad retrieval review before narrowing. Even then, the prompt should be staged so the model first identifies relevant structure and then reasons over a smaller active subset.

What is the simplest heuristic teams can adopt immediately?

Measure token payload before every agent call, cap active context well below the theoretical maximum, and summarize completed work into typed state instead of carrying raw transcripts forward. If your prompt keeps getting longer every step, the architecture is drifting in the wrong direction.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hacker News - Newest: "LLM"

What the Fallacy Actually Is

Why More Tokens Often Mean Less Thought

The Context Window Is Working Memory, Not Disk

Prior Art: Context Discipline Is Not New

The Production Pattern: Budget, Compress, Reconstruct

How This Changes Agent Design

Where This Pattern Breaks

Frequently Asked Questions

Does this mean long-context models are overrated?

When should I keep a lot of context in the prompt?

What is the simplest heuristic teams can adopt immediately?