























The default response to agent failure is to stuff more context into the prompt. The last five tool calls. The whole chat history. Three specification documents. Raw API responses. A full dump of the ticket thread. The assumption is obvious: more context means more information, and more information means better reasoning.
That assumption is wrong often enough to deserve a name. I call it the Context Window Fallacy: the belief that increasing the number of tokens in view reliably improves model performance. In production systems, the opposite is frequently true. Past a threshold, extra context dilutes signal, blurs the boundary between instructions and data, and increases the probability that the model converges on a plausible but incomplete answer.
TL;DR - Key Takeaways:
Large context windows are real capabilities. They make retrieval-heavy tasks possible. They reduce the need for aggressive truncation. They let a model compare two long documents in one pass. None of that implies that a model reasons better simply because more tokens are present.
The hidden error is a category mistake. Teams treat the context window as storage when it behaves more like working memory. Storage preserves information. Working memory must allocate attention across competing inputs. Once the working surface is crowded, the question stops being "is the information present?" and becomes "does the model allocate enough attention to the right information at the right moment?" Those are different problems.
This is not just a style preference. The Lost in the Middle line of work showed that models can miss relevant information in long contexts depending on where that information appears. The practical lesson is modest but important: presence in the prompt is not the same as reliable use.
That is why systems with large context windows still fail on seemingly simple tasks. The model is not blind. The relevant information is often somewhere in the prompt. The failure is allocation: too many tokens compete for the same limited control surface, and the model degrades from directed reasoning into token-weighted improvisation.
More context is not monotonic improvement. Once the active token budget is saturated, additional tokens behave less like knowledge and more like interference.
First: attention decay. Transformer attention is not uniformly distributed across long inputs. In long-context retrieval tasks, relevant information positioned in the middle of the prompt is more likely to be missed than information placed near the beginning or end. The practical result is familiar: teams retrieve the right chunk, append it to a giant prompt, and then discover the model ignored it because the prompt already had too many competing anchors.
Second: control-boundary collapse. The model does not experience your prompt as separate semantic layers. System instructions, user intent, scratchpad text, retrieved documents, and raw tool exhaust all enter as tokens. As the window grows, instruction hierarchy becomes less reliable. This is the same structural issue behind why prompts are not specifications: you are asking a statistical system to infer control boundaries that you did not encode explicitly.
Third: premature convergence. A bloated context window tempts teams to ask the model to plan, reason, execute, evaluate, and summarize in one pass. That looks efficient on a whiteboard. In practice it increases the chance that the model settles on the first coherent-looking trajectory and stops doing the deeper work. The model produces something that sounds complete because the cheapest path through the token distribution is often "plausible summary," not "full reasoning trace."
This is why large monolithic prompts often underperform smaller staged calls. The issue is not that the model lacks capacity. The issue is that the task surface has been flattened into one giant probabilistic step. Once you do that, the model has no explicit structure forcing it to separate recall, evaluation, and action.
The correct mental model is operational, not metaphorical. A context window is a scarce, volatile workspace. It should contain the minimum state required for the next decision, not the full historical trace of everything the system has ever seen.
That distinction matters because most agent pipelines over-feed the model with artifacts it no longer needs: raw search results after extraction is complete, entire tool responses after the relevant fields were already parsed, and full conversation history when only the last state transition matters. This is the same failure mode that durable execution tries to eliminate at the process level: state and exhaust are treated as the same thing.
If a workflow genuinely depends on long-lived memory, the answer is not "keep the whole transcript in every call." The answer is to move information into typed state. Persist the structured facts you need. Summarize completed work into a durable checkpoint. Reconstruct a small decision context for the next step. The model should see the current state, the active objective, and the few external facts that are actually load-bearing.
This is not a new idea. Search systems, retrieval-augmented generation, map-reduce summarization, compiler passes, workflow engines, and state machines all separate storage from the active working set. They differ in implementation, but they share the same instinct: do not make every step carry every fact.
The LLM-specific version is that prompt assembly becomes an explicit subsystem. You decide what belongs in state, what belongs in retrieval, what belongs in the immediate instruction, and what should stay out of the model call entirely.
| Pattern | What Goes Into the Prompt | What Happens |
|---|---|---|
| Monolithic context | Full history, raw tool output, all retrieved documents, and current instruction | High token cost, weak control boundaries, missed facts in the middle, verbose but brittle answers |
| Structured state | Typed state, current objective, narrow retrieval slice, current constraint set | Lower cost, clearer control surface, easier retries, and better consistency across steps |
The production answer has three steps.
Budget. Treat context length as an explicit resource with a soft ceiling. Pick an active-context budget well below the model's published maximum, measure every assembled prompt against it, and trigger compression before the prompt becomes a junk drawer. The exact threshold varies by task and model; the discipline matters more than the number.
Compress. Compress completed work into typed artifacts rather than prose summaries whenever possible. Extract the fields. Store the decision. Convert the raw chain of events into state. This is where a strong validator helps: validation at the boundary is worth more than another thousand tokens of historical narrative.
Reconstruct. Build each prompt from the current step's state, not from the full past. A model deciding whether to call a tool does not need the entire transcript of how the tool schema was discovered three turns ago. It needs the current task, the allowed actions, and the small set of facts that constrain the decision.
def build_context(state, retrieved_chunks, max_active_tokens=24000):
payload = {
"objective": state["objective"],
"current_step": state["current_step"],
"constraints": state["constraints"],
"typed_memory": state["typed_memory"],
"retrieval": retrieved_chunks[:4],
}
prompt = render_prompt(payload)
if estimate_tokens(prompt) <= max_active_tokens:
return prompt
payload["retrieval"] = compress_chunks(retrieved_chunks)
payload["typed_memory"] = compress_state(state["typed_memory"])
return render_prompt(payload)
The point of the function is not the exact threshold. The point is that context assembly becomes an explicit subsystem. Teams who do this well stop arguing about whether the model has a 200K window and start asking a better question: what is the smallest decision surface that preserves the task's causal structure?
If a model call requires the entire history to stay coherent, the system is under-modeled. Durable agents run on reconstructed state, not accumulated prompt sediment.
The architectural implication is direct. Agents should not be designed as giant prompt accumulators. They should be designed as stateful systems that reconstruct context per transition. A state-machine-style agent gives you smaller reasoning surfaces, clearer retry semantics, and fewer opportunities for the model to drown in its own history.
This also clarifies why some seemingly "smart" fixes fail. Raising the token limit does not solve weak state design. Better retrieval does not solve control-boundary collapse if every retrieved chunk is poured into the same bloated prompt. Even temperature 0 does not help if the instability is coming from prompt sprawl rather than sampling variance.
The practical standard is simple: every additional token in the active context should earn its place by changing the next decision. If it does not change the next decision, it belongs in storage, not in working memory.
Long context is sometimes the right tool. If the task is to compare two contracts, review a long incident transcript, audit a code file, or synthesize several documents before narrowing, the model may need a large working set in one pass. The mistake is not using long context. The mistake is treating window size as a substitute for context design.
Compression also has failure modes. A bad summary can erase the detail that later matters. A typed state object can encode the wrong abstraction. A retrieval slice can omit the relevant counterexample. Context discipline does not remove judgment; it makes the judgment explicit and testable.
The rule is therefore not "short prompts good, long prompts bad." The rule is: keep the active context as small as the next decision allows, and make every lossy compression step visible enough to test.
No. Long-context models are useful. The mistake is using raw window size as a proxy for reasoning quality. Bigger windows expand what is possible, but they do not remove the need for context discipline.
When the task truly requires comparing large spans of text in one pass: contract comparison, multi-document synthesis, or broad retrieval review before narrowing. Even then, the prompt should be staged so the model first identifies relevant structure and then reasons over a smaller active subset.
Measure token payload before every agent call, cap active context well below the theoretical maximum, and summarize completed work into typed state instead of carrying raw transcripts forward. If your prompt keeps getting longer every step, the architecture is drifting in the wrong direction.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。