llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts

I've been building a deterministic FSM execution kernel for LLM workflows. v0.8.0 just shipped to PyPI. Here's what it is, what's new, and where it's going.

What it is

Most LLM frameworks treat the model as the orchestrator. nano-vm flips that: the runtime is the orchestrator, the model is just one step in a deterministic graph.

δ(S, E) → S'

Current state + validated event = next state. The model cannot skip steps, reorder them, or escape guardrails. The FSM is the source of truth.

Four step types: llm, tool, condition, parallel. Programs are plain Python dicts. No DSL parser, no heavy framework magic, and zero dependency overhead.

program = Program.from_dict({
    "name": "customer_refund",
    "steps": [
        {
            "id": "analyze",
            "type": "llm",
            "prompt": "Valid refund? Reply 'yes' or 'no'.\nRequest: $user_input",
            "output_key": "decision",
            "allowed_outputs": ["yes", "no"],   # ← v0.8.0
        },
        {
            "id": "guardrail",
            "type": "condition",
            "condition": "'yes' in '$decision'",
            "then": "process_refund",
            "otherwise": "reject",
        },
        {"id": "process_refund", "type": "tool", "tool": "issue_refund",   "is_terminal": True},
        {"id": "reject",         "type": "tool", "tool": "send_rejection", "is_terminal": True},
    ],
})

The guardrail step cannot be bypassed regardless of what the model returns.

What's new in v0.8.0

allowed_outputs — LLM enum guard

Validates the model's raw output against an explicit list before the value touches anything downstream.

{
    "id": "classify",
    "type": "llm",
    "prompt": "Classify. Reply ONLY with: refund / query / other",
    "allowed_outputs": ["refund", "query", "other"],
    "on_error": "skip",   # → falls back to "refund" (first element) on mismatch
}

Three policies on mismatch: fail (default, trace → FAILED), skip (substitute allowed_outputs), retry (retry up to max_retries, then FAILED).

timeout_seconds + on_timeout — per-step LLM timeout

Prevents a hung API call from stalling the entire FSM.

{
    "id": "analyze",
    "type": "llm",
    "timeout_seconds": 5.0,
    "on_timeout": "fallback",   # → falls back to allowed_outputs[0] or ''
}

Two policies: fail (default) and fallback. Both features are independent and composable — you can use either or both on any llm step.

What it can do right now

Suspend / resume. Return "PENDING" from any tool → FSM → SUSPENDED, cursor persisted. Resume from any external event (webhook, approval, settlement). RUNNING → SUSPENDED → RUNNING → SUCCESS
Condition branching with ASTEngine. eval() is gone. Conditions are parsed into a validated JSON AST and evaluated by a sandboxed interpreter. No Python builtins accessible. Method calls (.lower() etc.) raise ASTEvalError at parse time, not silently return False.
GDPR tombstoning. Sensitive values stored as CapabilityRef tokens (vault://secret/). On erasure event: ref tombstoned, all projections return [REDACTED_TOMBSTONE], hash chain stays valid.
GovernanceEnvelope. Every successful step produces an immutable, append-only audit record: execution_id, step_id, policy_hash, canonical_snapshot_hash, sanitized payload.
MCP gateway (nano-vm-mcp). Exposes run_program, get_trace, list_programs etc. over stdio or SSE transport with bearer auth and SQLite WAL persistence. Works with Claude Desktop and any MCP client.
Budget guardrails. max_steps, max_tokens, max_stalled_steps — FSM halts with BUDGET_EXCEEDED or STALLED before the next step, not after.

Benchmark — v0.8.0 (WSL2 · Python 3.12 · MockAdapter · 3×5×10k)
10/10 PASS · 1,096,500 ops · 0 violations
ScenarioMean TPSp95
Refund pipeline
2,200/s
123 ms
Double-execution guard
2,800/s
69 ms
Budget enforcement
2,400/s
97 ms
Parallel throughput
1,000/s
196 ms
MCP store round-trip
11,000/s
0.13 ms
GovernanceEnvelope
2,100/s
108 ms
Crash consistency
11/s
115 ms
Replay equivalence
1,300/s
164 ms
Adversarial retries
2,600/s
87 ms
Long-horizon (1k steps)
95/s
11,887 ms

BM-INT-07 (Crash consistency): crash_rate=100% hash_match=100% — replay after simulated crash produces identical trace hash every time.

BM-INT-10 (Memory footprint): peak RSS 76.5 MB, alloc 3.62 MB for 1,000-step programs — no memory leaks detected.

Validated on real payment APIs

Two PoCs, both 9/9 tests passing with mock adapters:
MoMo Payment API v4 — 3-way condition branch, HMAC-SHA256 IPN verification, polling loop with retry, next_step/is_terminal DSL.
Stripe Payment API v1 — 3DS flow (REQUIRES_ACTION sentinel), refund pipeline with LLM classifier, webhook verification. Found and fixed two bugs in the process: "PENDING" sentinel collision (Stripe was returning it as a domain status, triggering FSM suspend), and silent ASTEvalError for .lower() in condition expressions.

What's coming next
Phase 0 (Immediate): ProgramValidator — static analysis at Program build time. Catches missing then/otherwise/next_step targets, unreachable steps, and cycle detection. Currently these fail at runtime; when dealing with LLM-generated workflows, static analysis is a must.

Phase 1 (Gateway Correctness): StateContext persistence between MCP calls in SQLite WAL. Right now, if the gateway process restarts after /create but before polling completes, you get a new requestId — which is a real financial duplicate risk. Closing this with an execution_contexts table + upsert on every step. Up next: TRACE projection to SQLite, GovernedToolExecutor (policy-level tool capability enforcement), idempotency_store, and native vm.step() MCP wiring.

Phase 2 (Dev Agent): nano-vm-dev-agent — the FSM runtime managing its own development stack (read_repo_files → generate_patch(llm) → run_mypy → run_pytest → write_repo_files). DA-1 milestone is done (12/12 tests). DA-2 will be the first live run against a real sprint task (StateContext persistence). Still working on search_code and reproduce_bug tool-functions before launching live.

Phase 3 (Observability): OpenTelemetry span per FSM step + incremental counters in Trace (llm_calls, tool_calls, retries_total).

Install
pip install llm-nano-vm==0.8.0

pip install llm-nano-vm[litellm]==0.8.0 # LiteLLM provider support

pip install nano-vm-mcp # MCP gateway

LLMs are completely optional. The runtime works perfectly fine as a pure, lightweight deterministic workflow engine.

Questions / feedback welcome!

推荐订阅源

DEV Community

What it is