Here is a small failure mode that cost me time for longer than it should have.
The agent would run the test suite. Tests would fail. The agent would announce that tests failed. Then, when it needed to know which tests failed, would either guess, ask, or run the suite a second time to scrape the output. Sometimes it would run it a third time. The failing tests were right there in the output of the first run; the agent didn't read them carefully enough to remember.
This is not the model being lazy. It is the model being honest about its attention. Test output is voluminous, the failure summary is buried at the bottom, and by the time the agent is reasoning about the next step ("which test failed, what assertion, what file"), the relevant lines have scrolled past whatever the agent actually retained. The cheapest way to get the information back is to run the command again.
Running the command again is the wrong answer. A full suite takes minutes. Doing it twice to learn something the first run already told you is a tax on every red-test loop, and the loops compound.
The fix: tee everything to a gitignored directory
The rule I added is short:
Every test command pipes through
teeinto.test-output/<scope>.log. The directory is gitignored. When you need to inspect failures,grepthe log; do not re-run the suite.
The agent's commands now look like:
make test 2>&1 | tee .test-output/full.log
make test-parallel 2>&1 | tee .test-output/parallel.log
make vitest 2>&1 | tee .test-output/vitest.log
That is it. The fix is one Unix utility, deployed deliberately.
Why tee is the right tool
The trick is that tee does not replace the stream; it forks it. STDOUT still flows to the terminal in real time. The agent still sees the test run as it happens, still gets the streaming feedback that lets it react to a hang or an obvious early failure. Nothing about the foreground experience changes.
What changes is that the run is also persisted. After the command exits, the output exists as a file. The agent can grep -n FAIL .test-output/full.log, jump to the failing test, read three lines of context, and decide what to do without burning another full suite run to recover information.
You could imagine alternatives. Redirect everything to a file and tail it (loses interactive feedback). Tell the agent to "remember the test output" (the failure mode this is supposed to fix). Increase the agent's context window (treats the symptom, not the cause). The tee approach is boring, which is its strength. It uses a tool that has been in every Unix shell since 1973, costs nothing, and survives every framework choice the project might make later.
The gitignored directory matters
.test-output/ is committed nowhere. It is ephemeral: overwritten on every run, never inspected by humans, never reviewed in a PR. Making it a real directory rather than /tmp keeps it scoped to the project, which means a grep from the repo root finds it without thinking.
I also keep the filenames stable. full.log, parallel.log, vitest.log; not timestamped. The agent never has to ask "which file is the latest"? The latest is the only one. If you want history, the agent can copy it before re-running. By default, nobody needs history; they need the most recent failure.
The general lesson
The interesting thing here is not tee. It is the shape of the fix.
An agent reading a 4,000-line test run and an agent grepping a file for FAIL are doing the same work in principle, but the second one is the work the tool is good at. Long, streaming output is something an agent should read past (most of it is success noise) and only return to when a question demands it. Persisting the output to disk turns the question "what failed?" from "scroll back through my context and hope" into "run a deterministic command and read the answer."
Wherever an agent re-runs a command to recover information that was already produced, there is a tee waiting to be added. Build output. Type-check output. Linter output. Long-running scripts that print progress. Any time the command is expensive and the output is the artifact, persist the artifact. The rule shape (fork the stream, persist the artifact, grep the file) generalizes far past test runs.
The agent does not need a better memory. It needs the harness to stop throwing away information that was already on the screen.

























