Agents are workflows. SirenSpec is the workflow tool that admits it.

TL;DR: Most production "agents" are really just workflows with a fixed sequence of LLM calls with some branching. SirenSpec is a YAML-first SDK that treats them that way. A whole pipeline can live in one .yaml file that a teammate can read in 30 seconds, you can validate before you run it, and can test in CI without spending a cent on tokens.

Two stories from the last year and a half.

A developer's autonomous agent spent $47,000 on itself in a runaway loop before anyone caught it. A different one burned $4,200 over a single weekend...63 hours of uncapped inference while its owner was at a wedding.

Neither developer was careless. Both wrote code that looked fine. The bug wasn't in a specific line: it was that the shape of the system, what runs, in what order, and what it's allowed to do, was scattered across a state machine in three Python files. You couldn't look at it. You could only trace it after something went wrong.

That's what gets me about the way most agent frameworks are built. Runaway loops are a known failure mode. But the deeper problem is that calling something an "agent" implies intelligence and autonomy, and that framing leads you to build something opaque by default. What both of those systems actually were, underneath the branding, was a sequence of LLM calls with some branching. A workflow. And workflows should be readable.

I'm Tristan, and I built SirenSpec because most production AI workflows shouldn't need a framework at all. They need a spec.

Most "agents" are just workflows

Worth asking before we go further: is "agent" even the right word for what most of us ship?

A few people arrived at the same answer recently. Anthropic's "Building Effective Agents" tells you to start with the simplest thing that works, usually a plain pipeline, and only reach for real agentic behavior when nothing simpler will do. Temporal's team was blunter: "agents are just workflows, really." And an arXiv survey of multi-agent pain points found the top two developer frustrations were orchestration semantics and policy enforcement, exactly the things that vanish into code in most frameworks.

Strip the branding and a production "agent" is usually a sequence of LLM calls, some shared context, a few conditional branches, and rules about what each step is allowed to do. That's a workflow. And if it's a workflow, the definition should be the first thing you read, not something you reconstruct from a pile of StateGraph.add_node() calls.

What SirenSpec is

SirenSpec is a YAML-first agent orchestration SDK. You write the whole pipeline in one file: the agents (model plus system prompt), the nodes (which agent runs, and where its output goes), the edges (order, plus optional branching), and the guardrails (injection detection, PII redaction, output validation, cost caps). Run it from the CLI and you get a JSON trace of every node, token, and decision.

version: "0.1"
env_file: .env

agents:
  researcher:
    model: "openai:gpt-4o"
    system: "Summarize the following for a non-expert."
  writer:
    model: "anthropic:claude-3-5-sonnet-20241022"
    system: |
      Write a 200-word blog intro from this research:
      {{ research.output }}

nodes:
  research:
    agent: researcher
    writes: working.research
  write:
    agent: writer
    writes: output.draft

edges:
  - from: research
    to: write

guardrails:
  - injection
  - name: length
    config:
      max_chars: 1000
  - name: cost_cap
    config:
      max_usd: 0.10

A two-agent pipeline, and also the complete answer to what it does, what it's allowed to do, and in what order. You don't need Python to read it. That matters more than it sounds, because the person who needs to read it usually isn't the person who wrote it.

Try it in 30 seconds

pip install sirenspec
sirenspec init            # scaffolds a workflow.yaml
sirenspec run workflow.yaml

Install, scaffold, run. No project setup, no boilerplate, and sirenspec validate will catch a broken workflow before it ever calls a model.

Three things it does differently

1. The whole workflow fits on one screen

Here's the GitHub triage example from the cookbook:

version: "0.1"
env_file: .env

agents:
  classifier:
    model: "openai:gpt-4o-mini"
    system: |
      Classify this GitHub issue. Return JSON with:
      category (bug|feature|question|docs), priority (low|medium|high), needs_repro (bool).
      Issue: {{ inputs.message }}
    guardrails:
      - name: schema
        config:
          schema:
            type: object
            required: [category, priority, needs_repro]
  responder:
    model: "anthropic:claude-haiku-4-5-20251001"
    system: |
      Write a friendly triage response.
      Classification: {{ classify.output }}

nodes:
  classify:
    agent: classifier
    writes: working.classification
  respond:
    agent: responder
    writes: output.response

edges:
  - from: classify
    to: respond

guardrails:
  - injection

The equivalent in Python code means wiring up functions, managing prompt strings separately, and threading context between calls manually. You're past 50 lines before you've written a single system prompt.

This isn't a line-count contest. It's about whether the shape of the workflow survives without a Python interpreter running in your head. Hand github-triage.yaml to your PM, your ops lead, or whoever inherits the project after you leave, and they can see what runs, in what order, and what it's not allowed to do. "Shorter code" and "a non-engineer can read it" are different claims. SirenSpec is going for the second one.

2. `sirenspec validate` fails before you push

Before a single API call fires:

sirenspec validate research-pipeline.yaml

✗ Node 'analyze' references undefined agent 'analyzr' — did you mean 'analyzer'?
✗ agents.verify.system: field required
✗ InterpolationError in '{{ missing_node.output }}': node not found

Each line is a real class of bug. A typo'd agent name gets caught at load by Pydantic instead of throwing a KeyError mid-run, which is a thing people hit in CrewAI. A node missing its system prompt surfaces here, not as a confusing provider error three steps in. And if node A's prompt references node B while B's references A, SirenSpec catches the cycle at load. LangGraph lets you build it and tells you at runtime.

validate exits 0 or 1, makes no API calls, and costs nothing to run. The bugs other frameworks find in production, yours finds in CI.

3. Guardrails ship in the box

agents:
  classifier:
    model: "openai:gpt-4o-mini"
    system: "Classify this support ticket."
    guardrails:
      - injection                    # prompt-injection detection
      - name: pii                    # redact before the model sees it
        config:
          entities: [email, phone, ssn]
      - name: length
        config:
          max_chars: 2000

These sit on the agent, right next to the model and the prompt. Not a separate library, not middleware, not a plugin you bolt on later. Cost caps live in the same place:

guardrails:
  - name: cost_cap
    config:
      max_usd: 0.50

That one line is the difference between the $47K story and a run that stops itself. It's optional; skip it for a low-stakes internal tool, but when you want it, it's one line, and anyone can open the file and confirm it's there. You can't say that about a setting buried in a Python state machine.

Cassettes: tests that don't call the API

sirenspec test records a real run once, then replays it. After that, CI runs against the recording: deterministic, instant, no tokens.

# Record against the live API, once
sirenspec test tests/triage_test.yaml --record --cassette cassettes/run.yaml

# Replay in CI — no live calls
sirenspec test tests/triage_test.yaml --mock --cassette cassettes/run.yaml

The closest comparison is Pydantic AI's TestModel, but that's a mock: you assert against synthetic output. A cassette is the real model's response, run through your real pipeline. So when a model update quietly changes what you get back, it shows up as a failing test in a PR, not as a strange trace in production three weeks later.

Render: turn your YAML into a diagram

One command turns any workflow into a Mermaid flowchart:

sirenspec render workflow.yaml --target mermaid

Here's the output for the email triage example, a workflow that fetches your latest unread Gmail, fans out to three classifiers in parallel (urgency, intent, sender reputation), then routes to whichever response agent fits:

graph TD
    fetch_email[fetch_email\npython tool]
    triage[triage\nswrm]
    urgency[urgency]
    intent[intent]
    sender[sender]
    synthesis[synthesis]
    draft_reply[draft_reply]
    forward_note[forward_note]
    archive_reason[archive_reason]

    fetch_email --> triage
    triage --> urgency
    triage --> intent
    triage --> sender
    urgency --> synthesis
    intent --> synthesis
    sender --> synthesis
    synthesis -->|reply| draft_reply
    synthesis -->|forward| forward_note
    synthesis -->|archive| archive_reason

Paste it into any Mermaid renderer and you get a diagram of your pipeline without writing a single line of diagram code.

This matters more than it sounds, because your workflow's audience is no longer just you. Your PM wants to know what it does. Your manager wants to audit it. And increasingly, your AI coding tools need to understand it too. Mermaid is significantly more token-efficient than ASCII diagrams for LLMs, with less chance of misinterpretation. Drop a rendered diagram into your CLAUDE.md or project README and Codex, Claude Code, or whatever you're pairing with can orient itself in seconds.

What it isn't

If you're sizing this up, here's where it stops.

No dynamic loops, no autonomous tool selection, no handoffs, no memory layer. You write the graph; the graph runs. Connectors, web browsing, and richer tool integrations are on the roadmap, but they're still in planning.

SirenSpec is for the script you've already written more than once: the one that calls OpenAI, retries on a 429, checks a JSON shape, counts tokens, and hopes. That script, with a spec you can read, a validator, and tests around it.

At a glance

	SirenSpec	Big Agent Frameworks	Raw SDK
Readable by non-engineers	✅	❌	❌
Pre-run validation	✅	❌	❌
Guardrails built in	✅	DIY	DIY
CI tests via cassettes	✅	❌	❌
Dynamic agent loops	❌	✅	✅
Provider-agnostic	✅	Varies	❌

A few questions I get

Does it support loops? Yes, via factory nodes. A factory iterates over a list and runs one agent instance per item, with configurable concurrency. The changelog annotator is a good example: one classifier per commit, then a release writer that aggregates them. Autonomous tool selection and open-ended handoffs are not supported.

Which providers? OpenAI, Anthropic, and Ollama today. Gemini, Bedrock, and Groq are on the list.

Why YAML instead of Python? Because the workflow is the thing you want to read, diff in a PR, and hand to someone who doesn't write Python. When the definition lives inside code, "what does this pipeline actually do?" stops having a quick answer.

How do I run the workflow in production? Currently, SirenSpec has a lightweight Python SDK shipped on install. You can load your workflow into Python and execute in a variety of ways

Final Thoughts

A lot of “agents” in production are really just workflows with retries, branching, and memory layered on top. That realization is what led me to build SirenSpec.

We’re still early at v0.1.1, which makes this a fun stage to experiment in.

I'd love to hear from you:

How much of your company's “agent” stack actually deterministic underneath?
If you're a non-technical founder, PM, or hobbyist vibecoder, when have you hit a wall building AI workflows or agents?

If any of that sounds familiar, I’d love to hear how your team is approaching it. You can check out SirenSpec on GitHub or browse the docs.

推荐订阅源

DEV Community

Most "agents" are just workflows

What SirenSpec is

Try it in 30 seconds

Three things it does differently

1. The whole workflow fits on one screen

2. `sirenspec validate` fails before you push

3. Guardrails ship in the box

Cassettes: tests that don't call the API

Render: turn your YAML into a diagram

What it isn't

At a glance

A few questions I get

Final Thoughts

推荐订阅源

DEV Community

Most "agents" are just workflows

What SirenSpec is

Try it in 30 seconds

Three things it does differently

1. The whole workflow fits on one screen

2. sirenspec validate fails before you push

3. Guardrails ship in the box

Cassettes: tests that don't call the API

Render: turn your YAML into a diagram

What it isn't

At a glance

A few questions I get

Final Thoughts

2. `sirenspec validate` fails before you push