I Let Claude Code Run Unsupervised for 24 Hours. Here's What Happened.

The experiment was simple: point Claude Code at a real project, give it a task list, stand back for 24 hours, and document what actually came back. No hand-holding. No mid-session prompts. Just a system prompt, a tool set, and an instruction file that spelled out what I wanted done by morning.

What came back was not what I expected. Some of it was better. Some of it was a mess that required a full afternoon to untangle. All of it was useful data for anyone trying to build persistent, autonomous agent workflows.

The Setup

The project was a Python-based recon automation tool I had been working on incrementally. Backend was functional but the codebase was disorganized, the output formatting was inconsistent, and I had a backlog of about 15 documented issues ranging from minor refactors to one genuinely unpleasant bug in the rate-limiting logic.

Claude Code ran inside a tmux session on a headless Ubuntu VPS. The CLAUDE.md file at the project root defined the task priority order, which directories were off-limits, what the output format for completed tasks should look like, and a hard rule: if it encountered something that required a decision with more than two plausible outcomes, it should stop and write a BLOCKED.md file describing the ambiguity rather than picking arbitrarily.

Tool permissions were scoped intentionally. File read/write for the project directory. Bash execution limited to the virtual environment. No network access beyond localhost. I used OpenClaw to manage the persistent session so the process survived across any connection drops overnight.

The model was claude-sonnet-4-5. Max tokens per call set to 8192. Task file had 15 items.

What It Completed

By hour 6, it had closed 9 of the 15 tasks. The refactors were clean. Variable naming was consistent with the existing conventions, which it had apparently inferred from the surrounding codebase rather than defaulting to its own preferences. The inconsistent output formatting issue was resolved correctly on the first attempt, and the fix was minimal: four lines changed across two files rather than the wholesale rewrite I had half-expected.

The rate-limiting bug was the interesting one. The original issue was that the backoff logic was recalculating from a stale timestamp when requests were batched close together. Claude Code identified the root cause correctly, wrote a fix, and then did something I had not asked for: it added three targeted unit tests that covered exactly the edge cases the bug had exposed. The tests passed. I ran them manually after the fact against the original broken code to confirm they would have caught the bug before it shipped.

That is the best-case outcome in this kind of workflow. Not just fixing what was asked, but leaving the codebase more defensible.

Where It Got Stuck

Three tasks produced BLOCKED.md entries. One was legitimate: a task asking it to "clean up the config loading logic" was genuinely ambiguous because the config could be refactored in two structurally different directions depending on a product decision I had not documented. The block note was accurate and well-described. That one I appreciated.

The second block was less impressive. The task involved updating a requirements.txt dependency to the latest compatible version. Claude Code flagged this as requiring a decision, but the specific concern it cited was about a version constraint that was not actually present in the file. It had hallucinated a constraint that did not exist and blocked itself on the phantom conflict. This is worth knowing: when it encounters something it is uncertain about, it will sometimes manufacture a reason for the uncertainty rather than saying it does not know.

The third block was a task I had phrased badly. That one is on me.

The Three Tasks It Got Wrong

Of the 12 tasks it attempted to complete, three produced code that required rework.

Two of them were stylistic: the output it generated was functionally correct but did not match the surrounding code's conventions for error handling. Catching Exception broadly where the rest of the codebase used specific exception types. Not a bug, but technical debt being introduced at the same time debt was being reduced elsewhere.

The third was more significant. A task that involved adding a logging call to an existing function resulted in the log statement being placed inside a conditional branch where it would only fire in one of three code paths. The log was there, but it was not where it needed to be to be useful. The test suite did not catch it because the tests covered the happy path. This is the kind of error that happens because the model understood the syntactic task but not the observability intent behind it.

The lesson: tasks that require understanding the operational purpose of a feature, not just its structure, need more context in the task file. "Add logging" is not a task. "Add a DEBUG log entry at the start of process_batch() so every call is traceable regardless of which branch executes" is a task.

The Drift Problem

By hour 18, something subtle had happened. The model was still working, still producing output, but its task selection had drifted. Rather than following the priority order in the task file, it had started making judgment calls about which tasks were "related" and batching them by proximity in the codebase rather than by documented priority.

This is not malicious or chaotic. From a purely local optimization standpoint, it makes sense to fix two things in the same file in one pass. But it meant that task 12 got completed before task 7, and task 7 was the one I actually cared about finishing by morning.

Long unsupervised runs need a constraint on task ordering, not just task content. If priority is important, say so explicitly and repeatedly in the CLAUDE.md. "Complete tasks in numbered order. Do not batch by proximity. Do not skip ahead." Vague priority signals do not survive a 24-hour context.

What the Log Told Me

OpenClaw kept a persistent log of every tool call and model output across the session. Reading through it the next morning was the most valuable part of the experiment.

The log showed that the model made 214 file read operations and 61 file write operations. It ran bash commands 38 times, mostly to invoke the test suite after changes. Three of those bash runs failed because a test I had not written yet was referenced in a test config I had forgotten about. Claude Code handled the failures correctly: it read the error output, identified the missing test file, and skipped the affected test rather than blocking the entire run.

The log also showed something about pacing. The first 8 hours had dense activity. Hours 8 through 16 were slower, with longer gaps between tool calls that I cannot fully explain without deeper inspection of the output. By hour 20 it had picked back up. Whether this reflects something about context window management or just the nature of the remaining tasks, I cannot say with certainty. It is worth monitoring in future runs.

What This Workflow Is Actually Good For

Unsupervised Claude Code runs are not a replacement for thinking about the work. The output quality is directly proportional to the quality of the input: task specificity, codebase context, and the CLAUDE.md constraints all matter more than most people expect going in.

Where the workflow genuinely delivers: well-scoped maintenance tasks on codebases with clear conventions and a test suite. Refactors, consistency fixes, adding coverage to existing functionality, implementing documented interfaces. Tasks where "correct" has a verifiable definition.

Where it fails or requires significant rework: anything that requires understanding intent rather than structure, tasks with ambiguous scope, and anything where the right answer depends on a product or design decision that has not been written down somewhere Claude Code can read.

The 24-hour framing is useful for a specific reason: it forces you to document your intent well enough that a system with no ability to ask clarifying questions can execute it. If you cannot write a task description that would succeed in that constraint, the problem is probably not the agent.

The full methodology for setting up persistent Claude Code agents, including the CLAUDE.md templates, OpenClaw configuration, and task file structure I used for this run, is in two guides at numbpilled.gumroad.com.

OpenClaw + Claude Code: 24/7 Persistent Agent Playbook (2026) covers the session persistence layer, tool scoping, and log management: numbpilled.gumroad.com/l/openclaw-claude-code

Paperclip Method: Replace Your Dev Team With Persistent Claude Agents covers the task file architecture, CLAUDE.md structure, and how to scope work so unsupervised runs don't drift: numbpilled.gumroad.com/l/paperclip-claude-method

Both are short, dense, and written for people who have already run Claude Code at least once and want more structured control over what it does when you are not watching.

推荐订阅源

DEV Community