Fire-and-forget AI engineering: letting agents ship a production app unsupervised

"An AI agent just built a production landing page, with GDPR audit logs and encryption baked in. I wasn't even at my desk."

That is not a lucky one-shot. It is a repeatable workflow. Piotr Karwatka recorded a full tutorial showing how to go from idea to a production-ready app on Open Mercato - the AI-Engineering Foundation Framework for CRM/ERP - with no babysitting and no ping-pong prompting.

This is the technical version: what the loop actually looks like, why it doesn't fall apart, and which patterns you can lift into your own stack.

The problem with conversational coding

The default AI coding loop is single-threaded and human-bound:

prompt -> generate -> you spot a bug -> correct -> re-prompt -> repeat

It holds for snippets. It collapses the moment the task touches real architecture - multi-tenancy, RBAC, event flow, encryption, audit logging. Corrections pile up in the context window, the agent loses the thread, and you are back to typing. You are the bottleneck, sitting in the inner loop.

The workflow in the tutorial moves you to the outer loop: you review a finished, tested PR instead of every keystroke.

goal -> agent: branch + implement + test + open PR -> you: review PR

The reason this is even possible on Open Mercato is that the hard architectural decisions are already encoded as conventions, specs and agent-readable skills (AGENTS.md, task routing, spec skills). The agent is not inventing how RBAC or GDPR logging should work - it reads the foundation and follows it.

1. Fire-and-forget: the autonomous PR loop

The execution agent owns the full unit of work:

1. git checkout -b feat/lead-capture-landing
2. implement against framework conventions
3. run the test suite (Playwright integration tests included)
4. open a structured PR: what changed, why, how it was verified

You are no longer correcting tokens. The deliverable is a reviewable artifact. In the tutorial the output is concrete: a live site capturing leads straight into the Open Mercato CRM, with GDPR audit logs and encryption on by default - not bolted on after a compliance pass.

2. Parallel agents without touching `main`

This is the part most people get wrong. One agent is trivial. N agents in parallel usually means file collisions and a corrupted main branch.

The fix is isolation by design - each agent on its own branch/worktree, never writing to main directly:

main
 |-- agent-a -> feat/landing-page      (worktree A)
 |-- agent-b -> feat/crm-webhook       (worktree B)
 +-- agent-c -> feat/consent-logging   (worktree C)

Parallelism is only useful if it is safe. Safety here is structural (separate branches/worktrees), not "hope the agents stay out of each other's way." This is what turns autonomous coding from a single-threaded demo into something that scales like a team.

3. Two-phase spec refinement (Claude + Codex)

The highest-leverage step happens before any code is written. Autonomous output is only as good as the spec, so the workflow generates the spec in two passes.

Phase 1 - architecture-compliant draft. A spec-writing skill produces a spec that already respects framework conventions instead of fighting them.

spec-skill -> SPEC.md  (modules, data model, routes, events, RBAC scope)

Phase 2 - adversarial / "philosophical" review. A second pass deliberately hunts for hidden gaps the first draft missed before a line of code is committed.

review pass -> checks: routing, caching, edge cases, failure modes, consent flow

Model pairing matters here: Claude and Codex are used across the phases so the spec is both convention-compliant and stress-tested. The cost of a wrong assumption is highest at the start, so that is where the scrutiny goes. By the time code is written, the thinking is done.

4. Multi-hour runs and the coordinator sub-agent

Agents run autonomously for hours, which exposes the real enemy of long agent sessions: context burnout. A single agent grinding a long task fills its window with history and loses coherence.

The fix is hierarchical orchestration:

            +---------------------+
            |  Coordinator agent  |  holds the plan, delegates, keeps context lean
            +----------+----------+
        +--------------+--------------+
        v              v              v
   exec agent A   exec agent B   exec agent C
   (fresh ctx)    (fresh ctx)    (fresh ctx)

The coordinator owns the map; the workers own the tasks and run with fresh, scoped context. That separation is what makes unsupervised multi-hour runs possible without the quality collapse that usually follows.

What actually generalizes

Strip away the demo and three engineering principles remain:

Foundation beats prompting. Agents are only as good as the decisions encoded around them. Ship conventions + specs + skills and the architecture stops being up for debate.
Specs are the leverage point. Compliant draft, then adversarial review. Front-load the thinking and you stop wasting agent runtime building the wrong thing correctly.
Orchestration is the new skill. Branch isolation + a coordinator sub-agent are the plumbing that turns one agent into a safe, parallel, long-running team. This is the work that does not disappear as models improve - it is what lets you use better models at scale.

The detail that is easy to skip: compliance was not a phase, it was a property of the foundation. For anyone shipping CRM/ERP in regulated markets, that is the whole game.

Try it

What is the longest you have ever let an AI agent run unsupervised? Drop it in the comments.

推荐订阅源

DEV Community

The problem with conversational coding

1. Fire-and-forget: the autonomous PR loop

2. Parallel agents without touching `main`

3. Two-phase spec refinement (Claude + Codex)

4. Multi-hour runs and the coordinator sub-agent

What actually generalizes

Try it

推荐订阅源

DEV Community

The problem with conversational coding

1. Fire-and-forget: the autonomous PR loop

2. Parallel agents without touching main

3. Two-phase spec refinement (Claude + Codex)

4. Multi-hour runs and the coordinator sub-agent

What actually generalizes

Try it

2. Parallel agents without touching `main`