This is Part 8 of the ForgeFlow series. Part 7: The File Modification Boundary documented the constraint that changed how we structure tasks: every autonomous task target should be a new file. We ended Part 7 at 12 projects, roughly 52 failure patterns, and 71 design rules. Part 7 closed with an open question: "Project 13 will be the first real test of whether CL-071 holds under normal conditions."
Quick terms for new readers:
- FC = Failure Catalog entry (a documented failure pattern)
- CL = Crystallized Lesson (a testable design rule derived from repeated failures)
- DEADLOCK = the system gives up after repeated identical failures
- ForgeFlow = a fully local, TDD-based autonomous coding system running on Apple Silicon
Part 7 ended with a hypothesis and a bet.
The hypothesis: CL-071 (every task targets a new file, never modifies an existing one) might reduce or remove the dominant failure mode we'd been observing. The bet: we'd set formal graduation criteria and run projects until we met them — or discovered why we couldn't.
We ran five more projects (with one intermediate rerun included in the data). On the seventeenth — a blog API with 14 tasks — all 33 tests passed without intervention or deadlock, completing in approximately 12 minutes.
This post is about the five projects between that hypothesis and this result, what the graduation criteria actually measured, and the failure that appeared after we thought we'd addressed all the known ones.
The Graduation Criteria
Before results, here's what we were measuring. We didn't want "it worked once" to count as graduation. We defined four conditions, all of which had to hold on a qualifying run:
| Criterion | Threshold |
|---|---|
| First-run pass rate (tasks passing on the first TDD cycle, no retry) | ≥ 85% |
| New FC yield per project | ≤ 2 |
| Repeat FC rate (previously solved patterns recurring) | ≤ 5% |
| Teacher escalation (human operator interventions mid-task) | Decreasing trend |
The logic: a graduated stack should show repeatable autonomous recovery within the tested scope (criterion 1), stop producing novel failure patterns at a high rate (criterion 2), not regress on already-solved problems (criterion 3), and require less human involvement over time (criterion 4).
We chose 85% rather than 100% for the pass rate deliberately. Occasional retries are expected behavior in a TDD loop — in ForgeFlow's architecture, the system is designed to recover from them. What we track is whether it recovers autonomously.
The Five-Project Path
Here's the longitudinal data from Part 7's endpoint (project 12) through the graduation run. Note: this table tracks the autonomous pass rate — tasks that eventually passed without human intervention, including retries. The graduation criterion uses the stricter first-run pass rate (no retries), which we measured separately for the qualifying run.
| # | Project | Tasks | Autonomous Pass Rate | New FCs | CL Count (at time) |
|---|---|---|---|---|---|
| 13 | comment-api | 12 | 83% | 0 | ~72 |
| 14 | order-api | 16 | 56% | 2 | ~74 |
| 15 | recipe-api | 14 | 57% | 1 | ~75 |
| 16 | bookmark-api v2 | 12 | 83% | 0 | ~76 |
| 16.5 | catalog-api-v2 | 12 | 83% | 1 | ~76 |
| 17 | blog-api | 14 | 100% | 1 | 77 |
The trajectory wasn't smooth. Projects 14 and 15 dropped below 60%. Then it recovered. In this sequence, plateaus tended to expose a new failure category; the system dipped, the failure got crystallized into a rule, and the next project incorporated the fix.
What changed between project 15 (57%) and project 17 (100%) was not a model upgrade or an engine rewrite. It was three additional design rules, each derived from a specific failure we observed and diagnosed.
The Dip: What Went Wrong on Projects 14 and 15
Projects 14 (order-api) and 15 (recipe-api) both hovered around 56–57% autonomous pass rate. The failures clustered around a few patterns:
Route endpoint isolation. Tasks that bundled multiple endpoints into a single file — GET list and GET detail in the same route module — showed a notably higher failure rate than single-endpoint tasks. The outputs showed scope-related failures: given two endpoints to implement, the model would sometimes complete one and leave the other as a stub, or attempt both and introduce inconsistencies.
We already had CL-043 (one task, one endpoint) from Part 6. But we'd been applying it loosely — allowing two closely related endpoints to share a task. Projects 14 and 15 showed us that "closely related" was too vague for this local execution loop. The rule needed to be absolute: one endpoint, one file, one task.
Import specification gaps. Route tasks that didn't explicitly list every required import in their task description had a high failure rate. The model would guess import paths, often incorrectly. CL-072 crystallized this: every route task description must include a complete "Required imports" block. For example:
Required imports: from fastapi import APIRouter, Depends;
from sqlalchemy.ext.asyncio import AsyncSession;
from app.database import get_db;
from app.schemas.author import AuthorCreate, AuthorRead
Decimal type mismatches. In project 16.5 (catalog-api-v2), a product model with a Numeric(10,2) price column exposed a subtle testing issue. The model wrote assertions comparing float literals to SQLAlchemy Decimal values — and 999.99 != Decimal('999.99') in Python. CL-076 captured this: any Numeric column test must use Decimal comparisons.
In our diagnosis, these looked less like model-capability failures and more like specification-precision failures — cases where the PRD left enough ambiguity for a 45GB quantized model to make a reasonable-but-wrong choice.
The Failure We Didn't Expect: FC-074
Project 17 (blog-api) was designed as the graduation attempt. We applied all 76 existing rules. The PRD passed our automated validator (50 checks passed, 0 failures). We expected fewer known-pattern failures.
The first three attempts all failed on the very first task — creating the Author model. Same error each time: red_apply_empty — the engine's signal that the RED-phase output contained implementation code rather than a test.
Here's what happened, step by step:
- Our setup script created a minimal model stub file — just the class name and primary key column. This was standard practice per CL-066 ("stubs should be PK-only").
- Before the RED phase (test generation), the engine runs FC-060 cleanup: it deletes the target implementation file so the model writes it fresh.
- FC-060 deleted the stub.
- The model didn't need the file to exist at generation time — the surrounding task context still described enough of the intended model structure (via data_models in the PRD and conftest import references) that it produced implementation code during RED instead of a test.
- The engine detected this as a scope violation and triggered
red_apply_empty. - Three retries. Same result each time.
We called this FC-074: the interaction between two previously validated rules (CL-066: keep stubs minimal, and FC-060: clean target files before RED) producing a new failure when combined.
This is worth pausing on. FC-074 wasn't a gap in any single rule. It was an interaction effect — two rules that had each been validated independently across multiple projects, producing a failure only in a specific sequence of operations.
| Rule | Behavior in isolation | Combined behavior |
|---|---|---|
| CL-066 | Minimal stubs reduce over-complete-stub failures | Creates a target file before RED |
| FC-060 | Deletes implementation target before RED to ensure clean state | Removes the stub CL-066 created |
| Combined | — | RED sees a missing target but enough context to generate implementation instead of a test |
The Fix: Stop Creating Stubs
The first instinct was to adjust the prompt wording — tell the model more explicitly to write a test, not an implementation. We tried that. Same failure. Prompt changes alone didn't resolve it; file-state became the stronger hypothesis.
The second instinct was to refine the stub. But we diagnosed the stub's existence as the likely trigger: FC-060 deleted it, and the residual context information was enough to derail the RED phase.
The third attempt was the simplest: don't create the stub at all.
CL-077: Setup scripts must not create model stub files. Model files are created from scratch by the task that implements them. The conftest wraps model imports in try/except so that earlier tasks can run before the model file exists:
try:
from app.models.author import Author
except ImportError:
Author = None
This inverted an assumption we'd held across the previous 16 project iterations. We'd operated under the belief that providing a stub — even a minimal one — helped the model by giving it a starting point. FC-074 suggested that in our current engine architecture, the stub hurt by creating a state that the cleanup logic couldn't handle cleanly.
After applying CL-077, the same blog-api project ran all 14 tasks to completion. 33 tests passed, zero intervention, approximately 12 minutes total.
What the Graduation Run Measured
Here's how project 17 scored against the criteria:
| Criterion | Threshold | Project 17 Result |
|---|---|---|
| First-run pass rate | ≥ 85% | 93% (13/14 first-shot, 1 retry) |
| New FC yield | ≤ 2 | 1 (FC-074) |
| Repeat FC rate | ≤ 5% | 0% |
| Teacher escalation | Decreasing | Zero escalations |
Project 17 met all four thresholds. The preceding project (16.5, catalog-api-v2) reached 83% — close but below the ≥85% line. So we are treating project 17 as the graduation point rather than claiming a two-project stable plateau.
To be precise about what this means and what it doesn't:
What it means: On the specific runs we executed — FastAPI + SQLAlchemy async + pytest projects with CRUD-level complexity and 1:N foreign key relationships, using Qwen3-Coder-Next 45GB Q4_K_M on Apple Silicon M5 Max 128GB with 77 design rules — the system completed the full project autonomously within the scope of new-file-creation tasks.
What it doesn't mean: We haven't tested more complex architectural patterns (many-to-many relationships, authentication flows, file uploads, WebSocket endpoints). We haven't tested with different model families or hardware tiers. The 100% figure is for one specific project run; it's a data point, not a guarantee.
77 rules is a lot of rules. Each one was derived from at least one observed problem. But the cumulative load of maintaining 77 interacting rules is substantial. We don't yet know if this scales — whether a 200-rule system would be manageable or would collapse under interaction effects. This matches a concern we are starting to track internally: beyond a certain threshold, adding more constraints may dilute model attention rather than improve output. In our design, we've set a ceiling of 20 CLs per prompt injection bundle to guard against this, but we haven't yet hit a project that tests that limit.
The Rule Accumulation Curve
One pattern we've been tracking across the series is how the rate of new rule discovery changes over time:
Projects 1–3: CL-001 to CL-020 (~7 per project)
Projects 4–6: CL-021 to CL-035 (~5 per project)
Projects 7–9: CL-036 to CL-051 (~5 per project)
Projects 10–12: CL-052 to CL-071 (~6 per project)
Projects 13–17: CL-072 to CL-077 (~1 per project)
The yield dropped from roughly 7 new rules per project to roughly 1. We're cautious about reading too much into this — it could mean we're approaching the boundary of what our current project complexity can reveal, rather than the boundary of what rules exist. More complex projects might expose entirely new failure categories.
But within the FastAPI + SQLAlchemy + CRUD scope, the flattening is visible in this dataset. The most notable new failure in this stretch was an interaction effect between existing rules — FC-074 — rather than an entirely novel pattern.
The Interaction Effect Problem
FC-074 taught us something we hadn't articulated before: as the rule set grows, the opportunity for interaction effects between rules increases. Each rule is validated independently, but the system runs them all simultaneously.
This resembles a familiar problem in complex systems: the space of pairwise interactions grows faster than the number of components. We can't test all combinations manually.
We don't have a systematic solution for this yet. What we have is a detection mechanism: when a failure occurs that doesn't match any existing FC pattern, we now check whether it could be an interaction between two rules that had both worked in isolation in prior runs. FC-074 was caught this way.
Whether this can be automated — detecting interaction effects without human diagnosis — is an open question. The engine could potentially track which CLs were active when a novel failure occurs and flag the pairwise candidates, but we haven't built that yet.
What Comes Next
Graduating from the FastAPI stack opens a question: what do we do with a graduated stack?
We see two directions, each answering a different question:
Direction A: Complexity escalation. Stay on FastAPI but increase project complexity — many-to-many relationships, authentication flows, nested resources, pagination. This tests whether the current 77 rules hold at higher complexity or whether new failure categories emerge.
Direction B: Stack transfer. Move to a different framework and measure how many of the 77 rules transfer. Our rules are categorized by stack tags — 29 are marked "universal," 32 are "fastapi"-specific. A new stack would test whether the universal rules actually are universal.
The question we're most interested in now isn't whether we can achieve another 100% run. It's whether a rule-based agent system can keep growing without becoming harder to reason about than the model it was designed to constrain.
Series Links
- Part 1: 164 Failures Before 35 Tests
- Part 2: We Didn't Migrate from n8n Because n8n Failed
- Part 3: The Determinism War
- Part 4: The Information Design Gap
- Part 5: DCR Wasn't Enough
- Part 6: The Bug Wasn't in the Model
- Part 7: The File Modification Boundary
ForgeFlow runs on a MacBook Pro M5 Max 128GB. Planning uses Claude (cloud API). Execution is fully local — Qwen3-Coder-Next 45GB via Ollama, gemma4:26b for QA, Docker sandbox, no API calls during the coding loop. The methodology and failure data are shared in this series.
If you're building something similar — local AI agents, TDD automation, failure catalog systems — I'd be interested to hear whether you're seeing interaction effects between your own accumulated rules. The comments are open.




















