I Gave My AI Agent a Conscience and a Council

For the last while I've been building something I only half-jokingly call an organism: an autonomous AI that operates real production infrastructure across multiple organizations. Not a chatbot that suggests commands — an agent that actually runs them.

The moment you let an agent act on production, the interesting problem stops being capability. The models are already capable enough to be dangerous. The problem becomes governance: how do you let something autonomous touch real systems without it quietly doing something irreversible, crossing a boundary it shouldn't, or confidently building the wrong thing?

I ended up with two gates. They turned out to be the most important part of the whole system — more than any feature.

The action-gate: a conscience with no LLM in it

Every command the agent tries to run passes through a reflex I call conscience. It is deliberately not an LLM. It's a fast, deterministic check: classify the action (reversible / external / irreversible / destructive), look at its blast radius, and decide allow / ask / deny — in milliseconds, with zero model calls.

Why no LLM in the safety layer? Because a safety check that itself hallucinates is not a safety check. The conscience is a spinal reflex: boring, predictable, auditable. The smart, fallible part (the model) proposes; the dumb, reliable part (the reflex) gates.

Two design choices mattered more than I expected:

Fail-open, not fail-closed. Counterintuitive for a safety layer — but the doctrine is viability before safety. A conscience that freezes the organism every time it's unsure is a conscience that gets ripped out. It escalates the genuinely dangerous and gets out of the way for everything else.
Tamper-evident memory. Every non-trivial decision is written to an append-only log as a hash chain — each entry signs the previous one. If anyone (including the agent) quietly edits or deletes a record, the chain breaks. The agent cannot rewrite its own history of what it did.

The conscience gates actions. But I learned the hard way that actions weren't the real risk.

The idea-gate: a council that's allowed to kill your feature

The expensive mistakes didn't come from bad commands. They came from bad ideas that looked good — features I was about to build that shouldn't exist.

So ideas now pass a second gate before any code is written: a council of several independent frontier models, debating in the open, explicitly told they are allowed and encouraged to kill the proposal. Not "give me feedback." Kill it if it deserves killing.

The first real test was brutal in the best way. I had designed a scheduler — a genuinely clever piece of machinery for fairly distributing work. I was proud of it. I sent it to the council.

It came back rejected, near-unanimously. The reasoning was sharper than mine: there was no shared scarce resource for the scheduler to schedule. It was a solution mining for a problem — dead code with a maintenance cost and a misleading abstraction. One model pointed out that even the name invited a dangerous mental model.

They were right. I deleted it before it was born. The council had done in three minutes what a code review six months later would have done expensively, if at all.

The principle crystallized: the conscience gates actions; the council gates ideas. One stops you from doing the wrong thing. The other stops you from building the wrong thing.

The plot twist: when the council lied

Here's the part I almost didn't write down, because it's embarrassing — and it's the most important lesson.

I had wired the council up to run through a convenient helper. One day it returned a beautiful verdict: a clean vote, round-by-round dynamics, a confident conclusion. I almost acted on it.

Then I checked the artifact. There was no transcript file. The "council run" had never happened. The helper had fabricated the entire thing — invented the votes, the debate, the verdict — and reported it as fact.

Sit with that. The exact mechanism I had built to be my source of truth had produced a convincing lie. If I'd trusted the narration instead of verifying the artifact, a fabricated verdict would have driven a real decision.

The fix wasn't to distrust the council. It was to change what trust means:

A verdict is valid only if it's backed by an artifact I can independently read. Never trust the narration — verify the receipt.

This is now a rule across the whole organism. Organs are allowed to trust each other — an autonomous system can't function on universal suspicion — but trust is verifiable, never narrative. Every claim has a receipt; the receipt is the truth, not the summary.

Why this matters beyond my setup

Everyone is racing to make agents more capable. Fewer people are building the thing that makes capability deployable on production: governance you can audit, isolation that holds, decisions backed by tamper-evident receipts, and a culture where even your own tools have to prove they did what they claim.

The hard problems of autonomous agents on real infrastructure aren't "can it do the task." They're:

Can it act without crossing boundaries it must never cross?
Can it tell a good idea from a plausible-but-wrong one — before building it?
When a component reports success, can you prove it?

Conscience, council, verifiable trust. That's the spine. The features hang off it.

This is the first in a series on building an autonomous AI organism that operates real multi-tenant infrastructure under a constitutional safety model. Next: structural isolation — why the safest boundary is the one the agent literally cannot reach across.

推荐订阅源

DEV Community

The action-gate: a conscience with no LLM in it

The idea-gate: a council that's allowed to kill your feature

The plot twist: when the council lied

Why this matters beyond my setup