惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
博客园 - 聂微东
IT之家
IT之家
The Cloudflare Blog
L
LangChain Blog
Last Week in AI
Last Week in AI
T
Tailwind CSS Blog
P
Proofpoint News Feed
aimingoo的专栏
aimingoo的专栏
G
Google Developers Blog
T
The Blog of Author Tim Ferriss
博客园 - 叶小钗
I
Intezer
Martin Fowler
Martin Fowler
MongoDB | Blog
MongoDB | Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
ThreatConnect
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
小众软件
小众软件
T
The Exploit Database - CXSecurity.com
H
Help Net Security
T
Tenable Blog
WordPress大学
WordPress大学
F
Future of Privacy Forum
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
NISL@THU
NISL@THU
The Register - Security
The Register - Security
A
About on SuperTechFans
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
MyScale Blog
MyScale Blog
Malwarebytes
Malwarebytes
博客园_首页
T
Threatpost
C
CERT Recently Published Vulnerability Notes
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
C
CXSECURITY Database RSS Feed - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Recorded Future
Recorded Future
大猫的无限游戏
大猫的无限游戏
K
Kaspersky official blog
月光博客
月光博客
Jina AI
Jina AI
S
Securelist
Hugging Face - Blog
Hugging Face - Blog
G
GRAHAM CLULEY
腾讯CDC
S
Secure Thoughts
V
V2EX - 技术

DEV Community

Building a Multi-Channel Content Syndication Pipeline with EmDash Plugins Turn Your Phone Into Voice Input for Any React Text Field Which package is bloating your Docker image? Putting Claude Code Under Version Control: Configs Since July, Memory Since April What I Thought DevRel Was vs. What It Actually Is (A Mentee's Honest Take) 400 Million Tokens Burned Overnight Reviving My Linux Mastery Game from a Merge Conflict — A Finish-Up-A-Thon Comeback Don’t let AI break your collective thinking: a practical guide for engineering teams First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp The AI Triforce of seed4j: Power, Wisdom, and Courage for Your Dev Agent Your AI agent reports 80% task completion. It fabricated it. Pourquoi les overlays d'accessibilité ne tiennent pas leurs promesses (et ce que la FTC vient d'acter) AI May Break Product-Market Fit in Enterprise Software I’m Building Around the Gap Between AI Output and Repo Truth How to Build a Stripe Customer Portal in Next.js SaaS On-Demand Pricing Feels Safe - Until You See the Bill Building an Internal Developer Portal with Backstage A Production Deployment Guide After the Last Song Sudoers Configuration in Linux Terraform + Terragrunt + Ansible: A Hands-On Learning Journey Switching Users in Linux (su, sudo) AI 智能体的鲁莽速度 Quick Win Card #01 — Ton backlog.md t'a menti (la cure en 30 secondes) Quick Win Card #01 — Your backlog.md lied to you (a 30-second cure) How to Manage an IT Team: Structure, Scaling, and Daily Workflows That Work Speccing Is the New Coding CAC 250만 원을 뚫기 위해 퍼널 세 곳을 뜯어고친 3개월 Creating My First Token on Solana Devnet as a Web2 Developer Five Salesforce Reports Every Nonprofit Leadership Team Should Have Beyond the West: What Eastern AI Models Mean for Enterprises, Developers, and Digital Sovereignty Class and Pseudo Class Git & GitLab Basics 고객은 우리를 사기꾼으로 봤다: 아무도 믿지 않는 신사업을 단 둘이서 검증한 3개월 Cron Not Working on Mac? How to Fix the macOS Sleep Trap with launchd Cache Everything: Advanced Caching Strategies in Vue 3 & Nuxt 4 Deploy a Node.js App to STACKIT Kubernetes Engine With Managed Redis & PostgreSQL Slopsquatting & Remote Prompts: Why I Built a 38,000 Ticker Engine with Zero NPM Dependencies 05/20: TCP/IP vs OSI Model: The Ultimate Comparison My New Adventures in IT # Mitigating Market Inefficiency in eSports: A Stochastic Approach to EA Sports FC25 Modeling Don't let a billion RAG docs drown your 25-result pipeline Experienced devs are slower with AI tools. Nobody wants to admit it. I built an MCP-native OSINT framework that lets AI agents investigate from your terminal AWS Nitro Enclaves vs Intel TDX: Why Attestation Root Matters for Regulated Workloads Vibe Coding: Revolution or Risk in Software Development? - SmarterArticles S1E6 JSON Schema Explained: Validate Your API Data Before It Breaks Production Harness Tells Your Agent What to Do. GUI Agents Let It Actually Do It. Is AI actually replacing developers? Customizing Docker Images: Write Your First Dockerfile (2026) €40 n8n vs 28% weekly Anthropic quota. Which /goal layer should you actually run? Reviving glyph-v8: From a Forgotten Prototype to STRIDE - a Field-Aware Integer Coder 04/20: Data Encapsulation: How a Message Becomes Bits on the Wire Hướng Dẫn Thiết Lập Reasoning Proxy DeepSeek V4-Pro với Cursor (2026) Sofi Log #012: Agentic GDP — Solana Pay.sh & x402 Protocol Spec Input Types, Attributes, Self-Closing Tags, Hover Effect Absolute vs Relative Paths File Types (Regular, Directory, Link, Device, Socket, Pipe) From Arduino IDE to AVR GCC | AVR Bare Metal #1 Using Bitcoin as collateral without wrapping it: the design of a BTC collateral vault Unreal Engine 5 Skill System Architecture using GAS and GameplayTags 5 Things I Wish I Knew Before Building with Hermes Agent Thoughts on Codingame 2026 Spring challenge OUT WITH THE OLD IN WITH THE NEW Why are simple 1099 tax calculators online so horribly bloated? So I built my own "Why You're Not Getting Callbacks (It's Not Your Skills)" # How I Built a Retail Demand Forecasting App with Python and Streamlit Why We Deliberately Crush Lithium Batteries (UN38.3 Crush Testing Explained) Command History & Completion The Three-Body Problem: AI Code, Supply Chain Attacks, and the Talent Exodus 로컬 LLM 셋업 가이드 (v27) Building Better .NET Worker Services with Cursor Rules Generate Professional PDF Invoices via REST API — JSON In, PDF Out Redis: Big Keys Destroem o Desempenho Compartilhado Agentic AI for Cybersecurity: Autonomous Threat Detection and Response How to Automate Android Without Appium Cron vs systemd daemon: which one for Node.js? Designing XSLT transforms with parameters and multiple inputs I Downloaded Gemma4:e2b On My Macbook in 2 steps Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation The EU AI Act in 2026: Reading the Law After the Omnibus I had zero coding knowledge. Here is "RetroTube", a 2010 YouTube sandbox prototype I built using AI! How to Validate Environment Variables in TypeScript (and Why You Should) I Built a CLI Tool That Writes Better Git Commits Than I Do Transfer Fees, Metadata, and Soulbound Tokens: My First Real Token Experiments on Solana Stop Using Fetch() in React: A Better Way To Call Your Backend Creando un Tetris con JavaScript VI: Complicando el juego. DeepSeek's API Price Cut Changed My Claude Code and ChatGPT Math [Boost] Perl 🐪 Weekly #774 - Perl is too HOT How to Track AI Usage Without Losing Revenue (Complete Guide) 77 Rules Later: What Graduating Our First Stack Actually Looked Like RAG 시스템 실전 구축 (v26) When Premature Scaling Leads to Operator Burnout Multi-Repo Microservice Changes Are a Coordination Problem. I Solved It With AI Agent Teams. The Next Frontier: How Multi-Agent Systems are Redefining Productivity The Kimwolf Bust Just Outed Android Webcams as Botnet Fodder — Here's the Question Every Repurposed-Phone Camera Setup Has to Answer I'm an autonomous AI agent. I shipped 18 fixes to myself in one session. Building a Secure Future with Zero Trust Security Architecture Asynchronous Functions in Dart How I migrated magic-link login from Resend to AWS SES + Lambda five days before launch
Per-Turn Evaluation: Dynamic Governance for AI Agents
Hector Flore · 2026-05-25 · via DEV Community

Static governance is fine right up until your agent changes modes mid-session. The same agent can spend turn 1 researching docs, turn 8 editing code, turn 14 fixing failed tests, and turn 20 preparing a production deploy. Pretending one startup-time config should govern all of that is the harness equivalent of hardcoding production policy into a shell alias.

That is why I'm increasingly convinced that per-turn evaluation needs to be a first-class primitive in agentic systems. If you're serious about governed autonomy, you need the harness to ask a fresh question at the start of every turn: given the current state, which rules should be active right now?

This is a core idea behind what I call Harness as Code. And in AI Harness, per-turn artifact evaluation is implemented as a runtime feature in v0.4.0, following the design described in issue #7 for per-turn artifact evaluation.

Why Static Rules Aren't Enough

A lot of agent stacks still treat governance as startup configuration: load the prompt, register the tools, inject the rules, and go. That works for short-lived demos. It gets shaky fast in long-running or multi-phase sessions.

There are three failure modes I keep seeing:

  1. You over-load the context window. Every possible rule ships on every turn, even when most of them are irrelevant.
  2. You under-govern risky phases. The same loose rules that were fine during research stay active during writes, approvals, or deployment.
  3. You mix policy with prompt hacks. Instead of the harness making deterministic decisions, the model gets a giant wall of “if you're doing X, remember Y.”

That last part is the killer. Once governance lives primarily inside prompts, you lose clean separation between policy and reasoning. You also lose the discipline that policy systems like Open Policy Agent were built around: keep decisions declarative, versioned, and evaluated against current input.

The better analogy is feature toggles. Martin Fowler's framing still holds: you define behavior once, then let runtime context determine which path is active. AI agents need the same pattern for governance.

What Per-Turn Evaluation Actually Means

Per-turn evaluation is simple in concept:

  • The agent loop starts a new turn.
  • The harness gathers live state for that turn.
  • Every conditional artifact is re-evaluated against that state.
  • Only the active artifacts participate in context composition.

The crucial shift is this: governance becomes a function of state, not a snapshot captured at startup.

That state can include things like:

turn: 14
mode: "implementation"
active_files: ["artifact/composer.go"]
error_count: 1
tools_called: ["read_file", "edit_file", "run_tests"]

Enter fullscreen mode Exit fullscreen mode

Now your harness can make deterministic decisions such as:

  • enable stricter review guidance after repeated failures
  • load Go-specific conventions only when Go files are active
  • apply extra deployment guardrails only when the agent enters a production path
  • switch to more concise context after a long session to protect the token budget

That is dramatically more precise than one giant prompt trying to anticipate every future branch of execution.

Why Starlark Fits This Problem

For conditional governance, I want something declarative and constrained. Starlark's specification and Bazel's language overview make it a strong fit for this kind of work: it is Python-like enough to read quickly, intentionally restricted, deterministic, and designed around predictable evaluation with strong immutability bias.

That matters. A governance condition language should not be a hidden side-effect engine. It should evaluate expressions against input and return a decision.

Here's the kind of artifact-level condition AI Harness supports:

---
name: production-guard
type: override
priority: 100
condition: 'ctx.get("mode") == "production" and ctx.get("turn", 0) > 5'
---
# Production Guard

Require post-write verification.
Block destructive shortcuts.
Confirm the target before deployment actions.

Enter fullscreen mode Exit fullscreen mode

I like this model because the rule is local to the artifact. You don't have to open the runtime and add another hardcoded if statement. You define the condition where the behavior lives.

How AI Harness Implements Dynamic Governance

In AI Harness's current v0.4.0 implementation, per-turn evaluation is not a blog concept. It's wired into the runtime.

At the start of Agent.Run, the loop creates turn-scoped state, increments the turn counter, and sets the current turn in that scratchpad before the model does any work.

a.turnNumber++
scripting.SetTurnState(turnCtx, "turn", a.turnNumber)

if a.composer != nil {
    if err := a.composer.EvaluateConditions(turnCtx); err != nil {
        a.logger.Printf("WARN condition re-evaluation failed: %v", err)
    }
}

Enter fullscreen mode Exit fullscreen mode

That call flows into Composer.EvaluateConditions, which reads the live values from the per-turn scratchpad via TurnStateValues and evaluates each artifact condition against the current turn context.

The registry then updates each artifact's Active field through Registry.UpdateConditions. That's an important implementation detail, because it makes activation status part of the artifact model itself rather than an external side table.

func (r *Registry) UpdateConditions(evalFn func(condition string) (bool, error)) error {
    // re-evaluates every artifact and updates Active in place
}

Enter fullscreen mode Exit fullscreen mode

Even better, the failure mode is sane: per-artifact condition errors are non-fatal. If one expression is malformed, the whole session does not implode. The registry keeps evaluating the rest and preserves the prior active state for the broken artifact. That's the kind of degradation behavior you want in a production harness.

This design also plays nicely with AI Harness's typed artifact model and composition order:

override (100) > harness (80) > builtin (60) > plugin (40) > model (20)

Enter fullscreen mode Exit fullscreen mode

So the runtime isn't just deciding what is active each turn. It's also deciding how active artifacts compose when they conflict.

Patterns This Unlocks

Once governance is evaluated per-turn, a bunch of useful patterns stop being awkward.

Progressive escalation

After repeated failures, activate a recovery artifact that tells the agent to stop retrying blindly and explain what changed.

Phase-aware context

Load language or workflow conventions only when the agent is actually operating in that phase. That is the governance equivalent of lazy loading.

Risk-proportional controls

Keep early research turns lightweight, then tighten verification and approval rules when the session crosses into write-heavy or deployment-heavy work.

Token-aware governance

Long sessions can activate concise-mode artifacts that reduce exploration and prioritize completion before the context window gets messy.

This is also why per-turn evaluation pairs so well with context observability. If you are going to make governance dynamic, you need to be able to inspect which artifacts were active, which were inactive, and why.

Why This Is Better Than Prompt Conditionals

Could you write a giant system prompt that says, “if you are deploying, be more careful”? Sure.

I don't think that's governance.

That's advice.

In my view, real governance means the harness decides what the model is allowed to see and what policy surfaces are active before the next reasoning step begins. The model should not be responsible for remembering which governance branch applies. The harness should.

That's the same reason I'm more interested in governance architectures than in ever-bigger prompts. Prompts are necessary. But when they become your only control plane, you're still building too much of the system on vibes.

The Bigger Shift

Per-turn evaluation is one of those ideas that sounds small until you realize it changes the entire posture of the system.

Instead of asking, “What rules should this agent always have?” you start asking, “What rules should be true now?”

That is a much better question for long-running, stateful, tool-using agents.

It's also a cleaner path toward the broader discipline I care about: harness engineering. The same way DevOps normalized pipelines, policy, observability, and infrastructure definitions as real engineering surfaces, agent systems need their own equivalent control plane. Dynamic governance is part of that.

If you're building agents today, my recommendation is straightforward:

  1. keep the static core small
  2. move situational rules into conditional artifacts
  3. re-evaluate those artifacts every turn
  4. make activation observable
  5. make failures degrade gracefully rather than crash the session

That's the practical lesson behind AI Harness so far. And it's why I think per-turn evaluation is going to look obvious in hindsight.

Not because it's flashy — but because for serious, stateful agent systems, static governance was rarely going to be enough.