Your AI coding agent doesn't need a smarter model. It needs your backlog.

Here is the uncomfortable thing I have landed on after a year of watching coding agents succeed and fail on real work: the model is almost never the bottleneck. Claude Code and Codex are both more than capable of the feature you are asking for. What breaks the run is that the agent cannot see the truth it is supposed to build against. The story. The acceptance criteria. The architecture decision it is meant to respect. The test that already exists for the thing it is about to rewrite.

So it guesses. The guess is locally reasonable and globally wrong, and you spend the afternoon unwinding it. The instinct is to reach for a smarter model. The fix is to give the model your backlog.

Why pasting context stops working

Most of us feed an agent context by pasting it. You paste the ticket, a few file paths, maybe a paragraph of background, and you let it run. This works for a self-contained task and falls apart the moment the work touches the rest of the system.

The reason is simple. Pasted context is a snapshot, and snapshots go stale inside the same session. The agent makes a change on step three that invalidates the assumption you pasted on step one, but the pasted text does not update, so by step seven it is reasoning about a version of the project that no longer exists. You are not giving it context. You are giving it a photograph of context and asking it to navigate a moving room.

The second problem is that the things that actually matter for a real feature are relationships, not paragraphs. Which architecture decision constrains this story. Which test verifies this acceptance criterion. What defect we last saw in this area. None of that lives in a paragraph you can paste. It lives in the links between artifacts, and a paste flattens all of it into prose the agent has to re-infer.

To be clear, this is not an argument that models do not matter. A better model is genuinely better at reasoning once it has the right inputs. The claim is narrower and more useful: for the failures most teams actually hit on bigger tasks, fixing the inputs beats upgrading the model, and it is cheaper.

What the agent actually needs

It needs a source of truth it can query on demand, not a wall of text you pasted once.

When the agent can query, it pulls the current state at the moment it needs it. It asks "what are the acceptance criteria for this story" right before it writes the code, not at the start of a session that has since drifted. It asks "what tests already cover this module" before it rewrites the module, so it stops breaking things it did not know existed. It asks "which decision governs this boundary" before it crosses the boundary. The context is live because it is fetched, not remembered.

For that to work, two things have to be true. The truth has to exist in a structured, linked form, and the agent has to have a way to reach it. The first is a product problem. The second is a protocol problem, and the protocol now exists.

MCP is the part that just got easy

The Model Context Protocol is the reason this is suddenly practical rather than a research project. MCP is the standard way for an agent like Claude Code or Codex to call out to an external system and read or write structured data. Instead of you copying your backlog into a prompt, the agent connects to a server and queries the backlog directly, the same way it would call any other tool.

// Instead of pasting context, the agent fetches it the moment it needs it:
const story = await mcp.call("get_story", { id: "STR-481" });
// -> { title, acceptanceCriteria, linkedTests, linkedADRs, status }
//
// Now it writes against the real acceptance criteria and the tests that
// already exist, not a snapshot you pasted at the top of the session.

It is worth being precise about why this beats the usual "AI that knows your data" pitch, which almost always means vector search. Embedding your docs and retrieving the most similar passage is fine for "summarize this page" and useless for "which decision constrains this story," because similarity is not the same as relationship. A graph answers the relationship question by traversal: this story, to the decisions in its epic, to the ones touching the same boundary. The retrieval is structural, not statistical, and structure is exactly what a coding agent needs when the task spans more than one file.

A concrete before and after

Take a normal request: add rate limiting to an API endpoint.

In the paste workflow, you copy the ticket, mention the endpoint, and let the agent go. It writes a reasonable rate limiter. It does not know you already have a rate-limiting utility in the codebase because that was not in the paste, so now you have two. It does not know the architecture decision that says limits live at the gateway, not the handler, because that ADR is in a separate tool nobody linked. It writes a test, but not one that matches the acceptance criterion about per-tenant limits, because the AC was three tabs away. The code looks fine in review and is wrong in three quiet ways.

In the queryable workflow, the agent reads the story, sees the per-tenant acceptance criterion, queries the architecture decisions for the area and finds the gateway rule, checks existing tests and finds the utility, and writes against all of it. The pull request that comes back is not just plausible, it is consistent with how your system already works. You review intent, not archaeology.

The model was identical in both runs. The inputs were not.

A quick way to tell if context is your problem

Look at your last five agent failures and sort them. If the agent produced code that was wrong about how your system works, that is a context problem, and plumbing fixes it. If it produced code that was technically fine but solved the wrong thing, that is a clarity problem, and better acceptance criteria fix it. If it produced code that was just low quality on a simple task, that is the one case where a better model actually helps. In my experience the first bucket is the largest by a wide margin, and it is the cheapest to fix.

Where this does not help, and where simpler is right

If your tasks are genuinely small and self-contained, scripts, one-file changes, throwaway prototypes, none of this matters. Paste the context and move on. Wiring up a source of truth for work that fits in one screen is overkill.

If your context problem is actually a clarity problem, no amount of plumbing fixes it. Half of "the agent did the wrong thing" is really "nobody ever defined what done meant in checkable terms." If your acceptance criteria are vague prose, the agent will build vague prose.

And if you live entirely inside one tool that your agent already integrates with deeply, you may have enough of this already. The gap shows up when the truth the agent needs is spread across your tracker, your docs, your diagrams, and your test tool, none of which talk to each other.

The shift in how I think about agents now

I used to treat the agent as the thing to improve. Better prompts, better model, better tooling around the prompt. I now treat the agent as fixed and the context as the variable. Given a capable model, the quality of the output is mostly a function of what the agent can see at the moment it acts. Improve what it can see and the same model gets noticeably better, on the same task, on the same day.

That reframing is freeing, because context is something you control. You cannot make the model smarter this afternoon. You can absolutely give it your backlog this afternoon.

This is the thesis I ended up building Stride around: one connected graph of stories, tests, and architecture decisions, exposed to your coding agents over MCP so they read the real thing instead of a paste. But the idea stands on its own no matter what you use. Give your agent your backlog, not a photograph of it.

What are the rest of you doing to keep agents grounded once the task is bigger than a single file? I am collecting approaches and would genuinely like to hear them.

推荐订阅源

DEV Community