How This Article Was Built (And Why I'm Showing You the Kitchen)
Disclaimer up front: I'm not entering the Hermes Agent challenge. I noticed the challenge and realized I could use my AI pipeline to write an article about Hermes Agent architecture. So I did. And thought, why not share both the result and the process that created it? What I actually want is your honest criticism.
Who Is The Author?
For the past several months I've been building Bestaiweb, navigating the shift from traditional development to AI. The site runs on Hugo, and the content is generated through what I call an AI content pipeline. The pipeline itself is built in TypeScript, orchestrated through Claude Code, and runs on Anthropic's Claude models. Still in progress.
That phrase — "AI content pipeline" — probably triggered your slop detector. Fair. Let me explain why I think this case is different, and then let you judge.
The Pipeline
BestAIweb currently has 450+ technical articles across 45 topic clusters. Every article goes through a multi-phase pipeline:
- Market scanning — an LLM agent surveys the current tool and framework landscape for each topic, identifying what's leading, what's declining, and what's emerging
- Query fan-out — the pipeline generates the questions a developer would actually search for, not the questions that sound good as headlines
- Research — a dedicated research agent gathers facts, version numbers, benchmark data, and source URLs. Everything gets a structured fact sheet
-
Writing — here's where personas come in. The pipeline has four author personas, each with a distinct voice and content type specialization:
- MAX — the engineer. Writes step-by-step guides. Pragmatic, implementation-focused, opinionated about tool choices
- MONA — the explainer. Breaks down concepts. Thinks in diagrams and mental models
- DAN — the reporter. Covers news, market shifts, and what just shipped
- ALAN — the critic. Writes opinion pieces and ethical assessments
- Claim verification — a separate agent cross-checks every factual claim against the research fact sheet. Unsupported claims get flagged
- Deterministic validation — a Python script runs 30+ structural and quality checks: word count, link integrity, frontmatter completeness, source coverage
- Hugo integration — the article lands in the static site with schema.org markup, generated images, and internal links
The Hermes Agent guide below was written by MAX using his guide template. His tone of voice is direct, specification-oriented, and allergic to hand-waving. The template enforces a fixed structure: prerequisites, numbered steps, pitfalls table, FAQ, and a deployable artifact at the end.
The Multi-Model Judging Layer
Pipeline generation was step one. Then came "a manual judging round". I paste the draft into ChatGPT, Gemini, and DeepSeek and ask each to evaluate it as a technical reviewer — checking factual accuracy, logical gaps, tone inconsistencies, and whether the advice would actually work if someone followed it.I then reviewed their feedback together with Claude Code and incorporated the changes that held up under scrutiny.
The AI Slop Question
Here's the question I keep circling back to: Is everything AI-generated inherently slop?
The reflexive answer in 2026 is "yes, obviously." And for most AI-generated content, that's correct. GPT-powered blog farms, SEO filler, those LinkedIn posts prompted with "write a thought leadership post about AI" — that is slop. Generated without specification, without sourcing, without verification, and without a quality gate.
But what about content where:
- Every factual claim traces to a documented source (GitHub issues, official docs, arxiv papers)
- A claim verification agent flags unsupported statements before publication
- A deterministic validator enforces structural quality independent of the LLM
- The voice and structure come from a multi-page specification, not a one-line prompt
- Multiple independent models review the output for different failure modes
Is that still slop? Or is it closer to what a well-managed editorial team produces — except the heavy lifting is done by LLMs under human direction?
I genuinely don't know the answer. That's why I'm sharing this.
What I'd Like From You
Criticism. Specifically:
- Does the article below read like AI slop? If yes, what gives it away — the sentence rhythm, the structure, the depth, or something else?
- Is the technical content accurate? If you've deployed Hermes Agent or any persistent agent framework, does the three-layer model match your experience? Did I miss a critical failure mode?
- Does the pipeline approach change anything? Is multi-phase generation with claim verification and multi-model judging enough to produce content worth reading? Or is it just expensive slop with better sourcing?
I'm not looking for "great article!" responses. I'm looking for the engineer who says "this is wrong because..." or "you missed the part where..." That feedback makes the next pipeline iteration better.
More Guides From the Same Pipeline
If you want to judge more output from the same pipeline and the same MAX persona, the full library has 95+ implementation guides from him, covering agents, RAG, training, inference, evaluation, and image generation guides.
What follows is the article as the pipeline produced it, after multi-model review. Judge for yourself.
How to Architect Always-On AI Agents with Hermes: Decompose, Specify, Deploy
TL;DR
- Persistent agents need three specs your chatbot never did: memory policy, tool boundaries, and session recovery
- Hermes Agent is model-agnostic — the model choice matters less than how you specify context, tools, and failure handling
- Always-on means always-failing-somewhere — build validation into the deployment spec, not as an afterthought
You spun up Hermes Agent on a Friday evening. Gave it access to Slack, a web scraper, and your project database. Told it to "keep the team updated on competitor releases." Monday morning: 47 Slack messages, three of them citing products that don't exist, and a web scraper loop that burned through your OpenRouter credits overnight. The agent ran exactly as specified. The specification was the problem.
Before You Start
You'll need:
- A Linux or macOS server (even a $5 VPS works — Hermes Agent runs on minimal hardware)
- An LLM provider account (OpenRouter, Anthropic, OpenAI, or a local runtime like Ollama)
- Understanding of function calling — how models invoke external tools
- A clear picture of what your agent should do when you're not watching
This guide teaches you: How to decompose a persistent agent deployment into specifiable components so Hermes Agent does what you intended — not what you literally typed.
What this guide does NOT cover:
- Production security hardening (firewall rules, secrets management, network isolation)
- Enterprise compliance (SOC 2, GDPR data residency, audit certification)
- Full evaluation frameworks (systematic benchmarking, regression test suites)
- Model fine-tuning or training (Hermes models are pre-trained; this guide covers the agent framework)
The Agent That Worked Until It Didn't
Here's the pattern. Developer discovers Hermes Agent. Reads that it has persistent memory, self-improving skills, 20+ platform integrations. Installs it. Connects everything. Types a system prompt. Walks away.
Two things happen next. Either the agent does nothing useful because the specification was too vague. Or it does too much because the boundaries were never set.
According to Hermes Agent GitHub Issues, long sessions exceeding 700K tokens trigger environment hallucination — the agent confuses tool descriptions with actual environment state. It starts acting on what it thinks is true rather than what is true. This isn't a bug in the traditional sense. It's a specification gap. You never told the agent when to stop, reset, or ask for help.
Step 1: Map the Three Layers
Hermes Agent is not a single system. It's three systems wearing a trench coat.
Your deployment has these parts:
- The runtime layer — where the agent executes (Docker, SSH, Modal, local terminal). This determines resource limits, restart behavior, and isolation
- The intelligence layer — the LLM provider and model. This determines reasoning quality, context window size, and cost per token
- The integration layer — platform connections (Slack, Telegram, web tools) and the tools the agent can invoke. This determines what the agent can touch in the real world
The Architect's Rule: If you can't draw a clear line between what the agent thinks, where it runs, and what it touches — your spec is incomplete.
According to Hermes Agent Docs, the framework supports 30+ providers and 7 terminal backends. That flexibility is the point — and the trap. Every combination has different failure modes. A Modal serverless backend hibernates when idle. An Ollama local model defaults to 4K context tokens. An SSH backend loses the agent if the connection drops. You need to specify which combination you're using and what happens at each boundary.
One thing the "always-on" framing obscures: what happens when the LLM provider goes down? OpenRouter has outages. API rate limits hit. Local models crash. An always-on agent needs a fallback plan — a secondary provider, a circuit breaker that pauses tool execution after N consecutive failures, or at minimum a notification that the agent is degraded. Specify this in the runtime layer, not as an afterthought.
Step 2: Lock Down the Context Contract
The intelligence layer needs a specification before it sees a single user message. This is where most deployments fail — not in the tools, not in the platform, but in the context that frames every decision the agent makes.
Context checklist:
- System prompt with explicit role boundaries (what the agent does and does NOT do)
- Memory policy: what gets persisted, what gets discarded, and when
- Tool authorization with risk classification (see table below)
- Access control: which platforms and channels can trigger the agent (not every DM deserves a response)
- Session limits: when to compress or reset (Hermes Agent Docs default to auto-compression at 50% of the model's context window, plus a hard ceiling of 400 messages)
- Output format contracts: how the agent reports results on each platform
- Rate limits: maximum messages per minute per platform (an agent with no rate limit is a spam bot waiting to happen)
Tool Risk Classification
An always-on agent with database access and Slack permissions is making autonomous decisions about your data and your team's attention. Classify every tool before you enable it.
| Risk Class | Description | Example Tools | Authorization |
|---|---|---|---|
| read-only | Observes, never modifies | web_search, database_query (SELECT), file_read | Auto-approved |
| reversible-write | Creates or modifies, can be undone | file_write, note_create, draft_message | Auto-approved with audit log |
| irreversible-write | Deletes or overwrites permanently | file_delete, database_delete, channel_archive | Requires human confirmation |
| external-send | Sends to humans or external systems | slack_post, email_send, webhook_trigger | Rate-limited + audit log |
| billing-sensitive | Incurs direct cost | api_call (paid), image_generate, compute_spawn | Budget ceiling + alert |
The Spec Test: If your system prompt doesn't mention what happens at 3 AM when the agent encounters an error and no human is online — you've specified a supervised agent and deployed it as unsupervised. If it doesn't classify tool risk levels, the agent treats
database_deleteandweb_searchas equally safe. If it doesn't set a compression trigger, the default (50% context window) may or may not match your workload.
Here's what a minimal context contract looks like in practice. This is the MEMORY.md the agent reads on every session start:
# MEMORY.md — Agent Operating Contract
role: "Monitor competitor AI product releases for the engineering team"
boundaries:
- "NEVER post to channels outside #competitor-monitoring"
- "NEVER summarize or forward internal company data"
- "NEVER execute irreversible-write tools without human confirmation"
- "Maximum 3 Slack messages per hour"
tools:
auto_approved: [web_search, file_read]
rate_limited: [slack_post] # max 3/hour
requires_confirmation: [file_delete, database_write]
forbidden: [email_send, channel_archive]
memory_policy:
persist: "confirmed competitor releases, product names, dates"
discard: "intermediate search results, draft summaries"
compress_after: "50%" # of context window
escalation: "If uncertain about any action, post to #agent-review instead"
A critical distinction: A memory or system-prompt policy is not a security boundary. Writing "NEVER execute irreversible-write tools" in MEMORY.md is a behavioral instruction to the model, not a technical lock. The model can ignore it — especially under long-context degradation or adversarial input. Destructive tools should be blocked or approval-gated at the runtime level (process permissions, API middleware, webhook filters), not merely discouraged in instructions. Treat the YAML above as the agent's intent. Build enforcement outside the model.
According to Hermes Agent GitHub Issues, the persistent notes layer has a limit of roughly 2,200 characters. That's the manually curated knowledge — not the agent's entire memory. Hermes also maintains a full-text search index over past sessions and a per-person user model that evolves automatically. So the agent isn't blind between sessions. But the notes layer is where you store hard constraints and project-critical context, and 2,200 characters fills up fast across three projects. You still need a compression strategy for notes — what gets stored verbatim, what moves to session history, what gets dropped.
Step 3: Wire the Components in Order
Deployment order matters. Each layer depends on the one below it.
Build order:
- Runtime first — because everything else crashes without a stable execution environment. Choose your backend, set resource limits, configure restart-on-failure
-
Intelligence layer next — because tool and platform behavior depends on the model's capabilities. According to Hermes Agent Docs, vLLM requires explicit
--enable-auto-tool-choiceand--tool-call-parserflags. Without them, the model outputs tool calls as plain text instead of executing them - Integration layer last — because platform connections should only activate after the agent can reason and recover from errors. Connect Slack after the agent handles tool failures gracefully, not before
For each component, your specification must cover:
- What it receives (inputs and triggers)
- What it returns (outputs and side effects)
- What it must NOT do (boundaries and prohibitions)
- How it handles failure (retry logic, fallback behavior, human escalation)
The self-improving skills feature is powerful — Hermes Agent automatically creates workflow documents from successful task completions and refines them over time. But the skill creation itself needs a boundary spec. Without one, the agent writes skills for one-off tasks, cluttering the skill library with noise.
Skill boundary example — add this to your system prompt:
skills_policy:
auto_create: ["competitor-monitoring", "weekly-summary", "data-formatting"]
never_create: ["one-off-queries", "debugging-sessions", "ad-hoc-searches"]
review_before_use: ["any skill not used in 14+ days"]
max_skills: 20 # force deduplication when library exceeds this
Without this, the agent treats every successful task as a reusable pattern. Three months in, you have 200 skills — most of them variations of the same web search with slightly different parameters.
One more thing about skills: they can regress. A skill written for Hermes-3-8B may produce wrong tool calls after switching to a different model. A skill that relies on a specific API endpoint breaks when that endpoint changes. Skills older than 30 days should be re-validated or archived. The review_before_use field above is your safety net — but only if you actually review them.
Step 4: Prove It's Actually Working
Running the agent is not validation. Validation means you know what "correct" looks like and can detect when the agent drifts from it.
Validation checklist:
- Memory consistency — after 24 hours, does the agent's memory reflect reality? Failure looks like: agent references a "completed" task that was never finished, or forgets a constraint you set yesterday
-
Tool call accuracy — are tool invocations well-formed and targeted? Failure looks like: invalid function names, malformed arguments, or calls to tools that aren't registered. This is a general problem with LLM-driven tool use, not Hermes-specific — any agent framework that delegates tool selection to a model will hit it. Hermes Agent GitHub Issues documents concrete examples like
todo:listcalls that don't match any schema - Platform output quality — are messages to Slack/Telegram/Discord useful and accurate? Failure looks like: hallucinated product names, duplicate messages, or empty responses
- Cost trajectory — is daily token usage stable or growing? Failure looks like: runaway context accumulation driving costs up 10x within a week
Common Pitfalls
| What You Did | Why the Agent Failed | The Fix |
|---|---|---|
| One-shot system prompt: "monitor competitors" | No boundaries — agent decides scope, frequency, and format | Decompose into: what to monitor, how often, where to report, what format |
| Connected all tools on day one | Agent uses tools in unexpected combinations | Enable tools incrementally, validate each before adding the next |
| Chose a 4K-context local model | Tool schemas + system prompt + memory exceed context | Use minimum 16K–32K context for tool-calling workloads |
| No session hygiene policy | 700K+ token sessions trigger hallucination loops | Use Hermes built-in compression (default: 50% context window) and set a hard message ceiling. Monitor context growth. |
| Skipped memory policy | Agent stores everything, including noise | Specify what gets persisted: decisions, outcomes, blockers. Not intermediate reasoning |
Pro Tip
The specification you write for Hermes Agent is not a prompt. It's an operating manual for an unsupervised system. The same decomposition — runtime, intelligence, integration — works for any persistent agent, regardless of framework. The tools change. The layers don't.
Frequently Asked Questions
Q: How does Hermes Agent's persistent memory differ from conversation history?
A: Conversation history is a raw log that grows until it hits the context window limit. Hermes uses three structured layers: persistent notes you curate manually, a full-text search index over past sessions, and a user model that evolves per-person. The practical difference — session history gets summarized and compressed, while persistent notes survive indefinitely. Watch for the 2,200-character limit on notes: it forces disciplined compression.
Q: Can I run Hermes Agent with local models instead of cloud API providers?
A: Yes — Ollama, vLLM, SGLang, llama.cpp, and LM Studio all work as backends. The catch is context window configuration. Ollama defaults to 4K tokens, which isn't enough once you add tool schemas and system prompts. Set the context window explicitly to at least 16K on the server side. For vLLM, you also need the --enable-auto-tool-choice flag or tool calls render as text.
Q: What context window size does Hermes Agent need for reliable tool calling?
A: According to Hermes Agent Docs, minimum 16K–32K tokens for agent workloads with tools. The system prompt, tool schemas, memory context, and conversation history all compete for the same window. With 5+ tools registered, 32K is the safer starting point. Below that, the model starts dropping tool definitions mid-session.
Q: How do I prevent hallucination loops in long-running Hermes Agent sessions?
A: Hermes has built-in session compression — by default it triggers at 50% of the model's context window, with a hard ceiling of 400 messages. According to Hermes Agent Docs, these thresholds are configurable. The documented failure zone is 700K+ tokens, where environment hallucination has been observed. Keep compression active, tune the trigger percentage for your workload, and monitor for repeated identical tool calls — that's the earliest signal of a loop forming. Store critical state in persistent notes before any forced reset.
Your Spec Artifact
By the end of this guide, you should have:
- A three-layer deployment map — runtime, intelligence, and integration with explicit boundaries between each
- A context contract with tool risk classification — system prompt, memory policy, tool authorization by risk class, access control, rate limits, and output format per platform
- A security baseline — tool isolation, rate limiting, audit logging, and escalation paths
- A validation checklist — memory consistency, tool call accuracy, output quality, and cost trajectory checks you run daily
Your Deployment Spec Prompt
This prompt generates a first draft of your agent specification — not a production-ready deployment. Paste it into Claude Code, Cursor, or your preferred AI coding tool. Fill in every bracketed placeholder with your specific values from Steps 1-4.
I'm specifying a Hermes Agent deployment. Generate a first-draft specification
based on these inputs. I will review and harden it before production use.
RUNTIME LAYER:
- Backend: [Docker / SSH / Modal / local — pick one]
- Resource limits: [RAM, CPU cores, disk]
- Restart policy: [on-failure / always / manual]
- Server: [OS, VPS provider, specs]
INTELLIGENCE LAYER:
- LLM provider: [OpenRouter / Anthropic / Ollama / vLLM — pick one]
- Model: [model name and size]
- Context window: [minimum 16K — specify exact value]
- Provider-specific flags: [e.g., --enable-auto-tool-choice for vLLM]
INTEGRATION LAYER:
- Platforms: [Slack / Telegram / Discord — list all]
- Allowed trigger channels: [e.g., only #competitor-monitoring, not DMs]
- Tools by risk class:
- read-only (auto-approved): [web_search, file_read, database SELECT]
- reversible-write (auto + audit): [file_write, note_create]
- irreversible-write (human approval): [file_delete, database DELETE]
- external-send (rate-limited): [slack_post — max messages/hour]
- billing-sensitive (budget ceiling): [paid API calls — max $/day]
- Tools forbidden: [list tools the agent must never invoke]
CONTEXT CONTRACT:
- Agent role: [one sentence — what this agent does]
- Explicit boundaries: [what the agent must NOT do, stated as prohibitions]
- Memory policy: [what gets persisted, what gets discarded, compression rules]
- Compression trigger: [percentage of context window — default 50%]
- Hard message ceiling: [number — default 400]
- Output format per platform: [e.g., Slack = bullet points, email = report]
- Skill boundary: [which task categories auto-generate skills, which don't]
SECURITY & PERMISSIONS:
- Access control: [which platforms/channels can trigger the agent]
- Rate limits per platform: [messages per minute/hour]
- Destructive action policy: [never auto-approve / require confirmation / forbidden]
- Audit log location: [where tool calls + results are logged]
OBSERVABILITY:
- Log format: [timestamp, tool name, input summary, output status, cost estimate]
- Loop detection: [alert on N repeated identical tool calls within M minutes]
- Cost alerts: [alert when daily spend exceeds $X]
- Error spike alerts: [alert when tool error rate exceeds X% in Y minutes]
DRY RUN:
- Generate a dry-run mode where all external-send and write tools are simulated
- Include 5 test scenarios that exercise each risk class
VALIDATION:
- How to verify memory consistency after [24h / 48h / 7d]
- Expected daily token usage range: [min–max tokens]
- Escalation trigger: [what condition sends an alert to a human]
RULES FOR GENERATION:
- Do not invent Hermes-specific configuration fields. If Hermes does not
support a field natively, label it as "external wrapper / policy layer
required".
- For every generated config field, mark one of:
[native] — Hermes Agent built-in setting
[prompt] — system prompt / MEMORY.md behavioral instruction
[external] — requires runtime middleware, API gateway, or wrapper script
[manual] — operational checklist item, not automatable
- Before generating final output, separate policy from enforcement:
- What the model is instructed to do (behavioral, can be ignored)
- What the runtime technically prevents (enforced, cannot be bypassed)
- What requires human approval (gated)
- What is only monitored after the fact (observable but not blocked)
Generate:
1. The MEMORY.md agent operating contract (see article for format example)
2. The tool authorization config with risk classifications (each field tagged
as native / prompt / external / manual)
3. A daily validation checklist
4. Cost and error monitoring alert thresholds
5. A dry-run test plan with 5 scenarios
Ship It
You now have a framework for specifying persistent agents that doesn't depend on Hermes Agent specifically — the three-layer model works for any long-running AI system. The difference between an agent that helps and one that burns your credits at 3 AM is never the model. It's the spec.
Different Perspectives
From the architecture side: The three-layer decomposition maps cleanly to isolation boundaries in distributed systems. Runtime is the execution substrate. Intelligence is the reasoning process. Integration is the I/O surface. What makes persistent agents architecturally distinct from request-response chatbots is that all three layers maintain state across invocations — and state synchronization between layers is where failure modes cluster. The memory limit finding is telling: the notes layer caps at 2,200 characters while session search and user modeling compensate, but the degradation curve of each layer matters more than the initial capability.
From the market side: The adoption velocity here is real — 157K GitHub stars in under four months signals a market that was waiting for open-source persistent agents. The competitive positioning against Claude Code and OpenAI Agents SDK is smart: Hermes doesn't compete on code quality or API simplicity, it competes on uptime and learning. The $5-80/month self-hosted cost structure undercuts every managed alternative. Watch for the enterprise play — the moment Nous Research ships team memory sharing, this becomes an infrastructure layer, not a developer tool.
From the governance side: The specification gap described above is a governance gap by another name. An always-on agent with tool access and persistent memory is making autonomous decisions on behalf of someone — and the specification determines whose values it encodes. The hallucination loop at 700K tokens is not just a technical failure. It's an agent acting on a reality that doesn't exist, with real-world consequences on the platforms it's connected to. Who reviews the specification before deployment? Who monitors drift between what was specified and what the agent learned? The self-improving skills feature means the agent's behavior changes over time without human approval. At what scale does that become a problem?
Sources
- NousResearch/hermes-agent - Official repository, release notes, community issues
- Hermes Agent Documentation - Provider configuration, deployment backends, platform integrations
- Provider Integration Guide - Context window requirements, vLLM flags, Ollama configuration
- Configuration Reference - Session compression defaults, message ceiling, memory hygiene settings
- GitHub Issue #5563 - Environment hallucination in long sessions, memory limits
- GitHub Issue #8993 - Tool calling instability (general LLM agent problem, documented here with Hermes-specific examples)
- Hermes-2-Pro-Llama-3-8B Model Card - Function calling format, benchmark results
- Hermes 3 Technical Report (arXiv:2408.11857) - Architecture, training approach, benchmark performance

























