Plain Harness Engineering in Practice

明天的乌云

Harness Engineering实践和分享 Agent与人的协作关系 Letting AI Actively Manage Its Own Context 让AI主动管理自己的上下文时间过得既快又慢做更好的信息阅读 Claude Code Router远程命令执行漏洞生僻字

透明人 · 2026-06-28 · via 明天的乌云

No complex Claude/Codex harness. Just an extremely simple and plain practice of Harness Engineering.

The project is 100% Artificial Intelligence. This post is 100% Artisan Intelligence.

Internal version: bytetech

In February 2026, OpenAI published an article on Harness Engineering.

Build and ship a software product without a single line of code written by humans

Humans steer. Agents execute.

Five months later, the repository had about one million lines of code, covering application logic, infrastructure, tools, documentation, and internal developer tools.

In March, I started a side project for a reading product and planned to practice Harness Engineering with it.

Three months later, again, not a single line of code was written by humans. The project has been fully taken over by agents.

What they took over is not just development, but the entire process including design, development, testing, and operations.

It took three months because this is a side project, and also because pure tokens are limited.

The whole tech stack is very simple and transparent. It also proves that Harness Engineering can be very plain and simple.

I wrote three of these eight items myself. Ad time.

Model: GPT-5.5 / DeepSeek-V4-Pro
- GPT-5.5 is the absolute main force. DeepSeek fills in when I run out of tokens.
Agent: Pi Agent
- I used Claude Code very early on and switched to Codex for a short period in the middle, but most of the main work was done with Pi Agent.
Context management: pi-context
- The earliest reason I switched to Pi was actually to iterate on this plugin. After many rounds of iteration, it has been completely rewritten into version 2.0 and is now very pleasant to use. It helps the model keep working for a long time with low context usage and high reasoning quality.
Task management: pi-tmux-task
- Pi does not have a background task feature. For long-running asynchronous tasks, such as sub-agent scheduling or dev-server operations, background tasks are necessary. But there is no need to over-design a tool. Letting agents manage things with tmux is enough.
Web search: pi-web-search
- Essential for searching docs and references.
Sub-agent: Paseo
- Pi does not have a sub-agent feature. In theory this could also be implemented with the Pi CLI, but Paseo already wraps it well and has a UI that makes observation easier.
General skills: waza
- think and hunt are very useful.
IDE: Zed
- Once code became something I only read but no longer write, even VS Code started to feel too heavy. Zed is lighter and makes switching worktrees very convenient. But VS Code’s Git features are much better than Zed’s.

The whole project is organized as a monorepo, including both frontend and backend. To keep things simple, the stack stays full-stack Node.js. The early backend was written in Go, and it may switch back to Go in the future after all, because Go uses less memory.

The overall workflow follows Spec-Driven Development, or SDD. Right now, both humans and agents can propose requirements as issues. Those issues then enter the development process. Humans only decide what to build, how to build it, and whether the final result is good enough.

I counted the whole codebase: 300k+ lines in total. Most of it is documentation. Production code and test code each take up about half of the code portion.

Language	Type	Files	Lines
Markdown	Task docs (SDD)	800+	170k+
-	Explanatory docs	100+	10k+
-	Total	900+	180k+
TypeScript/TSX	Production code	400+	70k+
-	Test code	200+	70k+
-	Total	600+	140k+

I only noticed after counting that their file sizes are surprisingly reasonable, averaging around 200 to 300 lines per file.

Documentation Engineering

The most important part of a harness is probably documentation. After all, documentation is a key part of the context. This does not only mean development docs. It should also include:

Product: feature design, thinking, decision tradeoffs, expectations, and so on
Development: architecture, algorithm flows, data relationships, and so on
Design: design style and taste
Testing: test docs and records of various bug cases
Operations: environment and operations manuals
…

Any knowledge that needs to be known should be documented. As long as there is documentation, agents can work.

Better Models and Lighter SDD

At the beginning I used superpower, but later found the workflow far too heavy.

As model capabilities improved, stuffing too much content or code into documents became much less necessary. So I migrated to a lighter custom SDD workflow.

Every task has only two files: what to do and how to do it.

issue.md: briefly describes the task goal, such as what feature to implement, what design direction to follow, or what bug to fix.
plan.md: records the solution, reasoning, decisions, implementation breakdown, and so on. It is similar to a compressed merge of spec and plan in superpower.

Because the project is purely local, has no dependency on any external system, and does not use GitHub Issues, everything is managed with folders plus Markdown metadata syntax. All content stays local.

- docs
  - issues/
    - readme.md
    - open/
      - 2026-06-26-xxxx.md
    - close/
      - 2026-06-26-xxxx.md
  - plans
    - readme.md
    - open/
      - 2026-06-26-xxxx.md
    - close/
      - 2026-06-26-xxxx.md

The readme files record explanations, conventions, and templates. For example, the current issue template uses metadata syntax to establish relationships between documents and code.

The templates were actually designed and gradually iterated by agents themselves.

title: "Short issue title"
status: open
kind: bug | feature | ux | performance | accessibility | security | security-hardening | hardening | tech-debt | docs | test
priority: urgent | high | medium | low | very-low | parked
triage: actionable | planned | needs-plan | needs-design | needs-decision | needs-profile | blocked | parked | wontfix
created: "YYYY-MM-DD"
updated: "YYYY-MM-DD"
areas:
  - web
  - reader
resolution_plan: "docs/plans/open/example.md" # optional
depends_on:                              # optional
  - "docs/issues/open/example.md"
follow_ups:                              # optional
  - "docs/issues/open/example.md"
related:                                 # optional
  - "apps/web/src/example.tsx"

Beyond that, documentation must be kept up to date and maintained like project code. Every plan requires updating the relevant docs.

Like OpenAI, using external scheduled tasks to check and update documentation is also a good approach. But I found that if you add an automatic documentation-update mechanism into the plan and link the documents together, the model is less likely to miss updates when editing docs. In comparison, scheduled tasks can only catch a limited amount.

Project Quality Assurance

Issue- and plan-driven development itself is simple, but that is still far from good engineering. Models make mistakes very easily.

Humans are the same. They not only make mistakes, they even slack off. Sometimes agents are more reliable than me.

Independent Review Feedback

Right now, every plan gets an independent review after it is written, and every implementation gets another independent review after it is done. The reviews check whether the plan is complete and whether the implementation is complete.

These two agents stand from different perspectives, which greatly improves accuracy and stability.

Complete Tests and Environments

As everyone knows, tests are an important quality guarantee. Every plan needs to add the relevant tests, and every later bug case needs corresponding test coverage so quality can keep improving.

Besides regular unit tests and E2E tests, I also added intent tests. These use the product like a real user would and rely on the model’s “intuition” to discover issues.

A vision model is required here. Acting like a real user means taking many screenshots and looking at them. A text-only model can only guess, and the results are poor.

First, independently build a complete functional inventory of the product and align it with the current product expectations. This also needs to be kept in sync with the docs.

Compared with E2E tests, the documentation should be complete rather than detailed. It needs to clearly list the expected features, but does not need to describe every detail. For example, it is enough to say that there is an add entry at some place, and it supports entering text and links.

Corresponding test-environment data also needs to be generated. To support parallel testing, for example running multiple test agents in one intent-test environment where one tests the home page and another tests the settings page, all irreversible test items, such as deleting a certain configuration, need independent test data.

In addition, this calls back to the areas field in the documentation. It marks the scope of a task’s changes, so after a plan is completed, only the corresponding tests need to run, saving time.

Human Decisions and Gatekeeping

In the early days of the project, I still read the spec and plan documents, and even did code review. But as the project became more complex, with some parts I could no longer understand, and as the number of tasks grew beyond what I could read, the process changed.

Now humans do only two things in the workflow:

Confirm the plan.
Confirm the code merge.

But I no longer read all kinds of files or the code itself. Instead, I read summaries, check results, and ask questions in the conversation to inspect things. It feels a bit like interviewing.

Every plan must provide information such as the task explanation and key decisions.
Every UI-related plan must provide a design image or an interactive HTML preview.
Every code merge must provide a summary of key changes and a preview URL for the project.

This makes it possible to have a clear and accurate understanding of all key logic and processes in the project even without writing a single line of code.

Only when the decision maker understands the matter well enough can they make the right decision.

Agent Concurrency and Management

After the first time I built and ran the full intent test, the agent wrote more than 20 issues at once. I immediately knew no human could keep up with reading all of that.

So I tried letting agents manage and solve these tasks themselves. That is how the multi-agent architecture appeared.

In short: the human only needs to talk to one main agent, and this agent manages all the other agents doing the work.

Many tasks are asynchronous. For example, you can start five threads to research issues and another three threads to develop already approved plans. Whenever a thread becomes available, it can automatically pick another task and assign it.

At this point, you will find that the main agent’s context becomes interwoven with many unrelated tasks. This places very high demands on the model, especially instruction-following ability. It needs to switch between tasks, discuss product design, discuss technical solutions, and orchestrate work, including defining priorities, analyzing task conflicts, and so on. It is roughly playing the role of a project manager plus technical leader.

So far, only GPT-5.5 can do this, and even it does not do it very well. GPT-5.4 and DeepSeek are simply unusable for this. Gemini is even less reliable.

It seems that, for now, only humans can do this octopus-like work well.

For example, suppose the agent is currently discussing the design of a plan with the user. At the same time, another agent finishes development and needs approval for merging. Ideally, after the agent finishes the discussion with the user, it should bring up the merge for the user to review.

One problem is losing tasks, though the probability is low. Sometimes the agent directly forgets about it and stops following up, saying it thought it was just a distracting notification.

Another problem is insufficient proactiveness, and this happens often. After finishing the discussion with the user, the agent does not proactively bring the matter up. The user has to ask something like “look at the next task.”

The harness can solve the task-loss problem, but after many rounds of iteration, proactiveness has not improved much. This may have to wait for model improvements.

Besides capability issues, when concurrency gets high, you can clearly feel that the main agent becomes the new bottleneck. It has to process messages from sub-agents, and then it has no time to talk to the user. This may also be related to sub-agent scheduling. For example, we could split it into two threads, one user-facing and one for background tasks, but that would be a large change. For now, I solve it by opening multiple tabs running main agents.

Another direction is to hand the proactiveness problem to the program, such as a board-driven workflow, similar to the product direction of multica.

How Harness Engineering lands and collaborates in a team may be similar to the concurrency problem.
If multiple users are allowed to create issues, a concurrent agent should handle those one or many things. Cloud agents can directly share one data source.
Collaboration between local agents and cloud agents may need something like Git or another synchronization mechanism.

Iterating on the Harness Itself

Harness is still work centered around context. The difference is that the context has become project documentation, and the agent’s context has moved from being organized by humans to being provided by humans and organized by agents.

A harness is not built in one shot. It is the natural iteration of documentation and workflow.

When the harness iterates on the project, there also needs to be another external loop that iterates on the harness itself and optimizes the existing workflow. In a sense, this is also a kind of project-level self-iteration.

The simplest approach is: if you find that the current conversation is not going well, start another agent and let it inspect what can be improved in the current conversation, then use that to improve the whole workflow.

Of course, this can also be turned into a scheduled task that periodically reviews conversations and iterates on the harness.

Harness is not a standard answer either. It is more like tailoring something to fit this specific project.

Different projects may have different harnesses. But there can be prefabricated patterns, with some customized changes added on top. And this is not limited to software engineering. There will be harnesses for all kinds of fields.

Once there are harnesses for different fields, there will also be Meta Harnesses that apply harness templates.

Suddenly I understand why FDEs, Forward Deployed Engineers, have become popular.

What Humans Still Need to Do

Harness Engineering undoubtedly brings a huge change to software development, but it is still humans steer, agents execute.

In the short term, on one hand, humans need to stop agents from piling up a mountain of bad code in the project and actively guide AI refactoring. On the other hand, humans still need to do acceptance testing. Model vision capability is still relatively weak, and it is hard for models to find interaction and UI problems. For example, some interaction experiences are continuous animations, and such issues are hard to discover from code or screenshots alone.

In the long term, humans only need to make key product and technical decisions. Agents can provide all kinds of product and technical options for humans to choose from.

As long as the product is for humans, humans are needed to align on what kind of product they want, to propose requirements, and to cut requirements.

There may be doubts about technical decisions. Why can humans make better and more correct decisions? Why can’t agents do it? Are you smarter than the model?

I do not think agents fail at this because of model capability. I think they lack context. Humans have a virtual expectation of the product’s future, meaning what the product will become in the future. But this expectation has not been turned into context and written into documentation. Or maybe it is even difficult to turn into a document at all. As Steve Jobs said, “A lot of times, people don’t know what they want until you show it to them.”

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

明天的乌云

Documentation Engineering

Better Models and Lighter SDD

Project Quality Assurance

Independent Review Feedback

Complete Tests and Environments

Human Decisions and Gatekeeping

Agent Concurrency and Management

Iterating on the Harness Itself

What Humans Still Need to Do