Your AI coding agent is a while loop with delusions of grandeur

The first time I used Claude Code to refactor an entire module, I had an almost mystical experience. I described what I wanted, went to grab a coffee, and when I came back there was a pull request with 14 changed files, updated tests, and a decent commit message. "This is magic," I thought.

It's not magic. It's a while loop.

Michael Bolin from OpenAI recently published an article breaking down the internal workings of Codex CLI. And it turns out the secret behind AI coding agents isn't some revolutionary algorithm or mysterious neural network. It's a loop that calls an LLM, executes tools, and repeats until there's nothing left to do.

Let's tear it open.

The state machine: 5 phases and a loop

Every coding agent — Codex, Claude Code, Cursor, whatever — executes the same fundamental pattern. Michael Bolin describes it as a 5-phase loop:

flowchart TD
    A["1. Prompt Assembly\n(build the prompt)"] --> B["2. Inference\n(send to LLM)"]
    B --> C{Tool call?}
    C -->|Yes| D["3. Tool Invocation\n(execute tool)"]
    D --> E["4. Tool Response\n(return result to LLM)"]
    E --> B
    C -->|No| F["5. Assistant Message\n(final response)"]
    F -->|New input| A

    style A fill:#2d3748,stroke:#4a9eed,color:#fff
    style B fill:#2d3748,stroke:#4a9eed,color:#fff
    style C fill:#4a3728,stroke:#ed9a4a,color:#fff
    style D fill:#2d3748,stroke:#4a9eed,color:#fff
    style E fill:#2d3748,stroke:#4a9eed,color:#fff
    style F fill:#283d28,stroke:#4aed5c,color:#fff

In plain terms:

Prompt Assembly: a massive prompt is built containing everything the agent needs to know — your message, system instructions, available tools, files it has read, and the complete conversation history.
Inference: that prompt is tokenized and sent to the model. The model returns a stream of events: internal reasoning, tool calls, or response text.
Tool Invocation: if the model asks to execute a tool (read a file, run a command, write code), it gets executed. If it fails, the error goes back to the model.
Tool Response Loop: the tool's result returns to the model as additional context. Steps 2-4 repeat until the model stops requesting tools.
Assistant Message: when the model decides it's done, it emits a final message and the cycle closes.

That's it. No knowledge graphs, no symbolic planners, no sophisticated architectures. It's a while loop with an LLM inside.

The difference between a good agent and a bad one isn't in the loop architecture — which is identical — but in the details of each phase.

Phase 1: The art of prompt assembly

The first phase is where everything gets cooked. Before the LLM sees a single line of your code, the agent has to build a prompt that includes:

flowchart LR
    subgraph Prompt["Prompt Assembly"]
        direction TB
        SP["System Prompt\n(personality, rules)"]
        Tools["Available tools\n(Read, Write, Bash, MCP...)"]
        Ctx["Files / images\nread previously"]
        Inst["CLAUDE.md / AGENTS.md\n(repo instructions)"]
        Env["Environment info\n(OS, shell, git status)"]
        Hist["Conversation\nhistory"]
        User["User message"]
    end

    SP --> Final["Complete\nprompt"]
    Tools --> Final
    Ctx --> Final
    Inst --> Final
    Env --> Final
    Hist --> Final
    User --> Final

    style Final fill:#283d28,stroke:#4aed5c,color:#fff

Already you can see a critical design decision: order matters. The prompt is built from most stable to least stable. The system prompt goes first (never changes), then tools (rarely change), then files and history (grow with each interaction), and finally your latest message.

Why this order? Prompt caching. Since caching works by exact prefix matching, putting stable content first maximizes the number of tokens read from cache on each iteration. Changing something early invalidates everything that follows. I covered this in detail in my article about prompt caching, but the key idea is: your prompt order isn't cosmetic, it's economic.

Then there are the CLAUDE.md and AGENTS.md files. Both are like leaving a note for the plumber before you leave the house: "the shutoff valve is under the sink, don't touch the blue pipe." The agent reads them on startup and injects them into every prompt. They're your mechanism for providing context without having to repeat yourself every time.

The quadratic problem: why context grows like a snowball

Here comes the reality check. Each loop iteration sends the entire complete conversation to the model. There's no server-side state. Each request is independent, stateless.

Why? Because this way the provider can guarantee Zero Data Retention — your data doesn't persist on their servers between requests. It's a privacy decision, not an efficiency one.

But it has a brutal cost:

flowchart LR
    subgraph Msg1["Iteration 1"]
        S1["System\n10K tok"] --> U1["User\n500 tok"]
    end

    subgraph Msg5["Iteration 5"]
        S5["System\n10K tok"] --> H5["History\n40K tok"] --> U5["User\n500 tok"]
    end

    subgraph Msg20["Iteration 20"]
        S20["System\n10K tok"] --> H20["History\n180K tok"] --> U20["User\n500 tok"]
    end

    style Msg1 fill:#1a2332,stroke:#4a9eed,color:#fff
    style Msg5 fill:#2a2332,stroke:#9a4eed,color:#fff
    style Msg20 fill:#3a1a1a,stroke:#ed4a4a,color:#fff

On iteration 1 you send 10K tokens. On iteration 5, you send 50K. On iteration 20, you send 190K. Each message resends the entire previous history. And since the transformer's self-attention mechanism has quadratic cost relative to the number of tokens, it's not just the amount of data sent that grows — the computational cost of processing it grows too.

Put differently: iteration 20 doesn't cost 20 times more than the first. It costs much more.

Compaction: compression without losing what matters

Both Codex and Claude Code have a solution for runaway context growth: compaction (or automatic compression).

When the history approaches the context window limit, the agent does something clever: it sends the entire history to a special endpoint that generates a compressed representation. Instead of 180K tokens of conversation, you get maybe 20K that capture the decisions made, files modified, and current task state.

flowchart TD
    Full["Complete history\n180K tokens"] --> Check{Near limit?}
    Check -->|No| Continue["Continue normally"]
    Check -->|Yes| Compact["Compaction endpoint"]
    Compact --> Summary["Compressed summary\n~20K tokens"]
    Summary --> NewCtx["New context\n= System + Summary + Last message"]
    NewCtx --> Continue2["Continue with fresh context"]

    style Full fill:#3a1a1a,stroke:#ed4a4a,color:#fff
    style Summary fill:#283d28,stroke:#4aed5c,color:#fff
    style Compact fill:#2d3748,stroke:#4a9eed,color:#fff

Compression isn't free. You lose detail. The model no longer has access to the exact diff you made in step 7, but rather a summary saying "refactored authentication module." For most tasks this is sufficient. For surgical debugging, it can be a problem.

Codex calls it compaction. Claude Code does something equivalent with automatic context compression. The idea is identical: when context gets out of hand, compress the past and move forward with a lighter version.

Sandbox: the golden cage

Both agents execute tools in a sandbox — a restricted environment where network access and filesystem access are limited by default.

This is fundamental. Without a sandbox, an rm -rf / generated by model hallucination would destroy your machine. With a sandbox, the worst case scenario is breaking something within the permitted boundaries.

Claude Code asks for confirmation for each potentially destructive operation (unless you explicitly approve it). Codex CLI operates by default in a similar explicit permissions mode.

The lesson here isn't technical, it's philosophical: an agent that can do anything is an agent you can't trust. Restrictions aren't limitations — they're guarantees.

Codex CLI vs Claude Code: non-identical twins

Now comes the fun part. Both are the same loop inside, but design decisions diverge at interesting points:

flowchart TB
    subgraph Codex["Codex CLI (OpenAI)"]
        direction TB
        CG["Desktop GUI\n(Command Center)"]
        CS["Generic shell\n(bash/terminal)"]
        CA["Automations\n(native scheduling)"]
        CD["Diffs with\ninline comments"]
    end

    subgraph Claude["Claude Code (Anthropic)"]
        direction TB
        CC["CLI-first\n(native terminal)"]
        CT["Dedicated tools\n(Read, Edit, Grep, Glob)"]
        CK["Skills\n(/blog, /improve...)"]
        CF["Conversational\nfeedback"]
    end

    style Codex fill:#1a2332,stroke:#4a9eed,color:#fff
    style Claude fill:#2a1a32,stroke:#9a4eed,color:#fff

Tools: generic vs specialized

Codex gives the model access to a generic shell. If you want to read a file, the model executes cat file.py. If you want to search text, it runs grep -r "pattern" ..

Claude Code does the opposite: it has dedicated tools for each operation. Read for reading files, Edit for editing them (with exact string replacement, not complete rewriting), Grep for searching, Glob for finding files by pattern.

Which is better? Depends how you look at it. The generic shell is more flexible — anything you can do in a terminal, the model can do. But dedicated tools are safer and more efficient. An Edit that only sends the diff of the change is faster and less error-prone than a cat > file.py << 'EOF' that rewrites the entire file.

My experience: dedicated tools win for 90% of cases. The generic shell wins when you need to do something exotic that no tool covers.

GUI vs CLI

Codex bets on a desktop GUI (Command Center) where you see diffs like in a pull request, can add inline comments to changes, and have a graphical view of what the agent is doing.

Claude Code is pure CLI. Your terminal. Your shell. No windows. If you want to review a change, the agent shows it to you in text. If you want to give feedback, you write it as another message in the conversation.

What do I prefer? The CLI, by far. And not out of hacker purism. It's that a CLI integrates with everything: tmux, scripts, cron, CI pipelines, remote control via SSH. A GUI ties you to a specific screen. For interactive sessions the GUI is more visual, yes. But for real work — long tasks, automation, agents running solo — the CLI has no competition.

Scheduling: native vs DIY

Codex has Automations: you can schedule tasks that run automatically (react to a GitHub event, launch an agent every morning, etc.). It's native scheduling within the platform.

Claude Code has none of that. If you want an agent to run every 30 minutes, you set up a cron job or systemd timer. If you want it to react to a webhook, you build the integration yourself.

Here Codex has an objective advantage for teams wanting automation out of the box. But Claude Code's DIY solution has a non-obvious advantage: you control the infrastructure. If Anthropic changes their API, your cron job keeps working because it's your machine. If OpenAI changes Automations, you're stuck.

What really matters

After dissecting the guts of both agents, the conclusion is almost disappointingly simple:

A coding agent is a loop that builds a prompt, calls an LLM, executes tools, and repeats. Period.

The magic isn't in the loop. It's in three things:

Model quality. A while loop with GPT-3 does nothing useful. With Claude Opus or GPT-4o, it refactors entire modules. The loop is the same — the brain inside the loop is what makes the difference.
Context management. The prompt can't grow infinitely. How you order information, when you compress, what you prioritize when compressing — that's where real engineering matters. An agent that loses critical context during compression makes mistakes a human never would.
Tool design. Giving an LLM unrestricted access to bash is like giving car keys to someone who's never driven. Well-designed tools (with validation, constraints and clear error feedback) are the difference between an agent that helps you and one that goes off the rails and deletes node_modules at three in the morning.

Next time your coding agent does something that seems like magic, remember: it's a while True with an LLM inside. Elegant, yes. Powerful, absolutely. But magic? Not quite.

Sources: The main article is "What Actually Happens Inside an AI Coding Agent (We Unrolled It)" by Michael Bolin (OpenAI). The Claude Code comparison comes from direct experience and Anthropic's official documentation. If you're interested in context and caching, read Por qué el 99% de lo que envías a Claude ya lo tiene en caché and El cache de tu LLM te cobra el doble por ahorrarte dinero.

This article was originally written in Spanish and translated with the help of AI.

推荐订阅源

DEV Community