Imagine hiring a brilliant software engineer who suffers from complete amnesia every time they blink.
Every time you ask them a question, you have to hand them their entire employment history, the codebase documentation, your style guide, and a summary of every conversation you’ve ever had with them. They process the information, give you a great answer, and then—blink—it’s all gone.
This is the exhausting reality of stateless AI applications.
Most developers building with Large Language Models (LLMs) today are stuck in this stateless paradigm. They write clever prompts, wrap them in an API call, and rely on the application layer to aggressively feed the entire chat history back into the context window with every new turn. It’s expensive, it’s inefficient, and it places a hard ceiling on how smart an agent can actually become.
To build truly autonomous, adaptive, and personalized AI systems, we must cross the chasm from stateless interactions to stateful agents.
In this deep dive, we will explore the architecture of the Hermes Agent—a stateful AI system that possesses persistent memory, a continuous learning loop, and the ability to evolve alongside its user. We will break down the engineering patterns behind statefulness and walk through a complete Python implementation to build your own self-improving agent from scratch.
(The concepts and code demonstrated here are drawn from my ebook Hermes Agent, The Self-Evolving AI Workforce)
The Stateless Ceiling: Why Vending Machines Make Poor Assistants
To understand the power of statefulness, we must first look at why statelessness cripples AI agents.
Think of a stateless system like a vending machine. You insert a dollar, press a button, and get a soda. The vending machine doesn't care who you are, what your health goals are, or that you bought the exact same drink yesterday. Every transaction is an isolated, self-contained event. It has no memory of its past, no context for the present, and no capacity to learn for the future.
Early LLM applications operate exactly like this. You send a prompt, and the model returns a response. The model itself does not change.
# A classic stateless utility call
import datetime
def parse_date(date_string: str) -> datetime.datetime:
return datetime.datetime.strptime(date_string, "%Y-%m-%d")
This simple Python function is a stateless transaction. It takes an input, returns an output, and immediately forgets the operation ever happened. It doesn't learn that you frequently parse dates from European formats, nor does it optimize its parsing logic over time.
When developers try to build "agents" on top of this stateless foundation, they usually resort to an illusion of continuity. They stitch together a chat history array and send the entire history back to the API on every single turn.
This approach has three massive flaws:
- Context Bloat: As the conversation grows, your token usage skyrockets exponentially.
- Memory Horizon Limits: Once the conversation exceeds the model's context window, the agent "forgets" the earliest parts of the interaction.
- Zero Knowledge Accumulation: The agent cannot carry lessons learned in Session A over to Session B. If it figures out a complex bash command to fix a Docker bug today, it will have to re-discover that solution from scratch next week.
A stateful agent breaks this paradigm entirely. It is not just a wrapper around an LLM; it is an evolving entity. It mirrors the workflow of a skilled artisan—like a master carpenter. The carpenter remembers the tools they used yesterday, the specific quirks of the wood they are carving, the preferences of their client, and the hard-won lessons from a project they completed last month. They do not start their education from scratch every morning.
The Triad of Persistent State: Soul, Memory, and Skills
In the Hermes Agent architecture, statefulness is not treated as a single monolithic database. Instead, it is partitioned into a carefully structured triad that mirrors how human professionals organize their own knowledge.
┌────────────────────────────────────────┐
│ SOUL │
│ (Core Identity, Style, Principles) │
└───────────────────┬────────────────────┘
│
┌───────────────────┴────────────────────┐
│ MEMORY │
│ (Episodic Facts, User Preferences) │
└───────────────────┬────────────────────┘
│
┌───────────────────┴────────────────────┐
│ SKILLS │
│ (Procedural Knowledge, Toolkits) │
└────────────────────────────────────────┘
Let’s break down each component of this stateful triad.
1. The Soul (SOUL.md)
This is the agent's core identity and "constitution." It defines who the agent is, its communication style, its behavioral boundaries, and its operational principles. It is not a dynamic log of facts, but a foundational document.
In the codebase, a helper function reads this markdown file and injects it directly into the system prompt. It ensures that whether the agent is writing code or debugging a server, its fundamental persona and safety guardrails remain perfectly consistent.
2. Memory (MEMORY.md and USER.md)
This is the agent's episodic and semantic memory store. Instead of keeping a raw, unorganized transcript of every chat, the agent maintains a curated, structured knowledge base of facts about the user and past interactions.
-
USER.mdtracks durable information about the user (e.g., name, programming language preferences, operating system, working hours). -
MEMORY.mdtracks dynamic, episodic facts learned during tasks (e.g., "The local staging database is hosted on port 5433, not 5432").
This layer is managed by a semantic MemoryStore class. The agent can read from this store to build context and write to it dynamically using custom tools.
3. Skills (~/.hermes/skills/)
If memory is "knowing what," skills are "knowing how." This is the agent's procedural memory.
A skill in Hermes is a reusable, packaged directory containing:
-
SKILL.md: A markdown file describing what the skill does, when to use it, and its input parameters. -
scripts/: Executable scripts (Python, Bash, etc.) that perform the task. -
templates/: Reusable code or text templates.
Instead of writing complex code on the fly every time, the agent can write a script once, save it to its skills directory, and call it as a custom tool in future sessions. It builds its own personalized toolbox.
The Closed Learning Loop: How the Agent Self-Improves
A stateful agent must be able to learn without constant human intervention. The Hermes Agent achieves this through a Closed Learning Loop executed entirely in the background.
This loop consists of two primary engines: Background Review and the Skill Curator.
┌───────────────────────────────────────────────────────┐
│ User Interaction │
└──────────────────────────┬────────────────────────────┘
│ Turn Completes
▼
┌───────────────────────────────────────────────────────┐
│ Background Review Thread │
│ (Spawns quiet, forked agent to analyze conversation) │
└──────────────┬─────────────────────────┬──────────────┘
│ │
▼ Extract Facts ▼ Extract Procedures
┌──────────────────────────┐ ┌───────────────────────┐
│ Memory Store │ │ Skills Engine │
│ (Updates MEMORY.md) │ │ (Creates SKILL.md) │
└──────────────────────────┘ └────────┬──────────────┘
│
▼ Runs asynchronously
┌───────────────────────┐
│ Skill Curator │
│ (Archives stale files)│
└───────────────────────┘
The Background Review (Self-Reflection)
When a conversation turn completes successfully, the agent doesn't just sit idle waiting for your next message. It increments internal counters: _turns_since_memory and _iters_since_skill.
Once these counters hit a configured threshold (e.g., every 5 to 10 iterations), the agent initiates a self-reflection phase:
- Forking the Agent: The system spawns a background thread that instantiates a forked copy of the current agent. This copy is set to
quiet_mode=True, meaning it operates in complete silence without cluttering the user's console. - The Reflection Prompt: The forked agent is fed a specialized prompt (e.g.,
_COMBINED_REVIEW_PROMPT) along with the recent conversation history. It is asked to analyze the transcript and answer two questions:- Did the user share any new preferences or facts that should be saved to long-term memory?
- Did we execute a complex, successful multi-step procedure that should be codified into a reusable skill?
- Autonomous Tool Execution: The silent background agent runs its own mini-reasoning loop. If it identifies new facts, it calls the
memorytool to updateUSER.mdorMEMORY.md. If it identifies a new procedure, it calls theskill_managetool to write a newSKILL.mdto disk. - Reporting Back: Once the background thread finishes, the parent agent prints a clean, non-intrusive summary of what it learned (e.g.,
[System Info: Memory updated - User prefers PyTest over Unittest]).
The Skill Curator
An agent that constantly learns skills will eventually suffer from "tool bloat." If its toolbox has 500 highly specific scripts, the system prompt will become overwhelmed, and the LLM will experience severe context distraction.
To prevent this, a background daemon called the Skill Curator (agent/curator.py) runs periodically.
- It tracks skill usage via a metadata file (
.usage.json). - If a skill hasn't been used for a configurable number of days, the Curator automatically moves it to an
.archive/directory. - Archived skills are removed from the active system prompt but can be restored instantly if the agent needs them again.
- Users can "pin" critical skills to exempt them from archiving.
Building a Stateful Agent from Scratch
Let's put these architectural patterns into practice. Below is a complete, production-grade Python script demonstrating how to initialize and run a stateful AI agent using SQLite-backed session storage and markdown-based long-term memory.
Prerequisites
To run this code, make sure you have the necessary environment variables set up for your LLM provider (we'll use OpenRouter pointing to Claude 3.5 Sonnet in this example):
export OPENROUTER_API_KEY="your-api-key-here"
The Implementation
#!/usr/bin/env python3
"""
stateful_agent_demo.py
A complete, runnable example of a stateful AI agent.
This script demonstrates persistent memory, cross-session database logging,
and semantic recall across separate agent executions.
"""
import os
import uuid
import logging
from datetime import datetime
from pathlib import Path
# --- Core Stateful Agent Architecture Imports ---
# AIAgent: The central orchestrator managing the reasoning loop and tool execution.
from run_agent import AIAgent
# SessionDB: SQLite-backed persistent store for conversation history with FTS5.
from hermes_state import SessionDB
# MemoryStore: Semantic memory engine managing local markdown databases.
from tools.memory_tool import MemoryStore
# Constants: Helper to get standard home directories.
from hermes_constants import get_hermes_home
# Configure clean logging to observe the agent's internal state transitions
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("StatefulDemo")
# =====================================================================
# Step 1: Establish the State Directories
# =====================================================================
HERMES_HOME = get_hermes_home()
HERMES_HOME.mkdir(parents=True, exist_ok=True)
# Define paths for our SQLite session database and memory files
SESSION_DB_PATH = HERMES_HOME / "sessions" / "stateful_demo.db"
SESSION_DB_PATH.parent.mkdir(parents=True, exist_ok=True)
logger.info(f"Initializing stateful storage at: {HERMES_HOME}")
# =====================================================================
# Step 2: Initialize the SQLite Session Database
# =====================================================================
# SessionDB automatically provisions tables for sessions, messages,
# and full-text search indexes (FTS5) to enable rapid cross-session recall.
session_db = SessionDB(db_path=str(SESSION_DB_PATH))
# =====================================================================
# Step 3: Initialize the Long-Term Memory Store
# =====================================================================
# MemoryStore reads and writes structured facts to memory.md and user.md.
# We set strict character limits to prevent context bloat.
memory_store = MemoryStore(
memory_char_limit=2000,
user_char_limit=1000
)
# Load any existing facts from prior runs
memory_store.load_from_disk()
# =====================================================================
# Step 4: Configure and Run Session 1 (Learning the User)
# =====================================================================
# We generate a unique session ID for our first conversation.
session_id_1 = f"session_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:4]}"
logger.info(f"Starting Session 1 ID: {session_id_1}")
# Instantiate the stateful agent
agent_1 = AIAgent(
base_url=os.getenv("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1"),
api_key=os.getenv("OPENROUTER_API_KEY"),
provider="openrouter",
model="anthropic/claude-3.5-sonnet",
max_iterations=30,
session_id=session_id_1,
session_db=session_db,
skip_memory=False,
platform="cli",
)
# Inject our persistent memory store into the agent instance
agent_1._memory_store = memory_store
agent_1._memory_enabled = True
agent_1._user_profile_enabled = True
agent_1._memory_nudge_interval = 1 # Force memory review immediately for this demo
print("\n" + "="*70)
print(" SESSION 1: TEACHING THE AGENT PREFERENCES")
print("="*70)
user_msg_1 = "Hello! My name is Dr. Aris Thorne. I am a bioinformatician, and I prefer code snippets written strictly in Rust."
print(f"\n[User]: {user_msg_1}")
# Execute the conversation loop
result_1 = agent_1.run_conversation(
user_message=user_msg_1,
task_id="task_001"
)
print(f"\n[Agent]: {result_1['final_response']}")
print(f"\n[System]: API calls executed: {result_1['api_calls']}")
# Flush the in-memory changes to disk (persisting user.md and memory.md)
if agent_1._memory_store:
agent_1._memory_store.save_to_disk()
# Explicitly release client connections
agent_1.release_clients()
# =====================================================================
# Step 5: Configure and Run Session 2 (Testing Memory Recall)
# =====================================================================
# To simulate a real-world scenario where the application was closed,
# restarted, or run on a different day, we instantiate a completely
# new agent instance with a fresh session ID.
session_id_2 = f"session_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:4]}"
logger.info(f"Starting Session 2 ID: {session_id_2}")
# Reload the database and memory files from disk
session_db_reload = SessionDB(db_path=str(SESSION_DB_PATH))
memory_store_reload = MemoryStore(memory_char_limit=2000, user_char_limit=1000)
memory_store_reload.load_from_disk()
agent_2 = AIAgent(
base_url=os.getenv("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1"),
api_key=os.getenv("OPENROUTER_API_KEY"),
provider="openrouter",
model="anthropic/claude-3.5-sonnet",
max_iterations=30,
session_id=session_id_2,
session_db=session_db_reload,
skip_memory=False,
platform="cli",
)
agent_2._memory_store = memory_store_reload
agent_2._memory_enabled = True
agent_2._user_profile_enabled = True
print("\n" + "="*70)
print(" SESSION 2: VERIFYING KNOWLEDGE RETRIEVAL")
print("="*70)
# We ask a highly ambiguous question that requires previous context to answer correctly.
user_msg_2 = "Can you write a quick function to parse a DNA fasta header?"
print(f"\n[User]: {user_msg_2}")
result_2 = agent_2.run_conversation(
user_message=user_msg_2,
task_id="task_002"
)
print(f"\n[Agent]: {result_2['final_response']}")
if agent_2._memory_store:
agent_2._memory_store.save_to_disk()
agent_2.release_clients()
# =====================================================================
# Step 6: Cross-Session Full-Text Search (FTS5) Demonstration
# =====================================================================
print("\n" + "="*70)
print(" SESSION DATABASE: CROSS-SESSION SEARCH")
print("="*70)
# Search the SQLite database for any reference to "Thorne"
search_query = "Thorne"
search_results = session_db_reload.search_sessions(search_query, limit=5)
print(f"\nSearching database for '{search_query}'...")
print(f"Found {len(search_results)} relevant records:")
for idx, record in enumerate(search_results):
print(f"\n [{idx + 1}] Session: {record.get('session_id', 'Unknown')}")
print(f" Snippet match: ...{record.get('snippet', '')}...")
print("\n" + "="*70)
print(" DEMO COMPLETE: Stateful execution verified.")
print("="*70)
Deep Dive: The Stateful Agent Loop in Practice
How does the agent coordinate all of this state behind the scenes? The magic happens inside the run_conversation() method within run_agent.py. Let’s trace the exact lifecycle of a single turn.
┌───────────────────────────────────────────────────────────────────────────┐
│ 1. Context Assembly │
│ Reads Soul, Memory, Active Skills, and Platform hints to build system │
│ prompt. Caches it to maximize LLM prefix-cache hits. │
└─────────────────────────────────────┬─────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────────┐
│ 2. Preflight Check & Compression │
│ Measures token count. If history exceeds threshold, triggers proactive │
│ context compression before making API calls. │
└─────────────────────────────────────┬─────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────────┐
│ 3. Tool-Calling Loop (Reasoning) │
│ - Calls LLM with stateful prompt. │
│ - Validates and executes tools (e.g., File I/O, Sandbox Execution). │
│ - Monitors guardrails to block infinite loops. │
│ - Checks for mid-turn user steering commands (/steer). │
└─────────────────────────────────────┬─────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────────┐
│ 4. Post-Turn Learning │
│ Spawns background reflection thread to extract memories and skills. │
└─────────────────────────────────────┬─────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────────┐
│ 5. Session Persistence │
│ Writes the entire turn (system, user, tool, assistant messages) to │
│ SQLite DB and local JSON logs. Guaranteed write on crash/interrupt. │
└───────────────────────────────────────────────────────────────────────────┘
1. Context Assembly
When you call run_conversation(), the agent doesn't just construct a simple system message. The _build_system_prompt() method compiles a highly structured, multi-layered environment:
- The Soul: Injected at the top to set the core persona.
- Persistent Memory: The contents of
MEMORY.mdandUSER.mdare dynamically formatted and injected. - Skills Guidance: A dynamic list of currently active skills and their execution templates.
- Context Files: Local environment files like
.cursorrulesorAGENTS.mdare appended.
To keep this process highly performant, the system prompt is compiled and cached (_cached_system_prompt). It is only rebuilt when context compression is triggered, maximizing prefix cache hits on modern LLM APIs (like Anthropic and DeepSeek) and reducing latency by up to 80%.
2. Pre-Turn Context Management
Before sending the payload to the API, the agent checks if the conversation history is approaching the model's limits. If it exceeds the compression threshold, the agent proactively condenses the oldest history into a structured summary. This prevents unexpected context-length failures on the first turn of a resumed session.
3. The Tool-Calling Loop
The agent enters a reasoning loop. It makes an API call, parses the requested tool calls, validates their JSON arguments, executes them, and appends the results back to the message history.
During this loop, two unique stateful safety features are active:
- Tool Guardrails: A controller tracks repeated, non-progressing tool calls (e.g., repeatedly running
lsbecause it can't find a file). If a loop is detected, the guardrail halts execution to prevent runaway API bills. - Steering Injection: The loop checks for
/steerinputs, allowing users to inject guidance mid-turn without interrupting the underlying execution thread.
4. Session Persistence
Finally, the agent persists the entire session. Whether the run succeeded, failed, or was manually aborted via Ctrl+C, the _persist_session() method is guaranteed to run. It commits the exact state to both a local JSON log and the SQLite SessionDB.
Resource Safeguards: The Iteration Budget
Statefulness introduces a major engineering challenge: resource management.
When an agent has the power to call tools, write scripts, read files, and trigger background self-reflection loops, it can easily get caught in an infinite loop. A single unhandled exception in a tool could cause the agent to call the API hundreds of times, burning through thousands of dollars in tokens in minutes.
To solve this, Hermes utilizes a thread-safe IterationBudget class.
class IterationBudget:
def __init__(self, limit: int):
self._remaining = limit
self._lock = threading.Lock()
def consume(self, amount: int = 1) -> bool:
with self._lock:
if self._remaining >= amount:
self._remaining -= amount
return True
return False
def refund(self, amount: int = 1):
with self._lock:
self._remaining += amount
The IterationBudget acts as the agent's fuel gauge.
- Every API call and tool execution consumes a portion of the budget.
- The budget is thread-safe and shared between the parent agent and any spawned background reflection agents. This prevents a background thread from spinning out of control.
- The Refund Mechanism: If the agent executes a highly efficient, cheap programmatic tool (like reading a local file or checking a system variable), the iteration is refunded. If it executes a heavy, slow, or expensive tool (like running a web browser sandbox or calling a sub-agent), the budget is fully consumed.
This programmatic budgeting ensures that statefulness does not come at the expense of financial and computational safety.
Conclusion: The Shift from Tools to Partners
The transition from stateless to stateful AI is more than an engineering upgrade; it is a fundamental shift in how humans interact with software.
A stateless agent is a utility tool. It is a hammer—reliable, but entirely dependent on you picking it up, positioning it, and swinging it correctly every single time.
A stateful agent is a partner. It learns your codebase, remembers your architectural preferences, builds its own library of custom tools, and refines its performance silently while you sleep. By implementing the triad of Soul, Memory, and Skills, and orchestrating them within a closed learning loop, we can build systems that don't just process text—they accumulate wisdom.
The future of software belongs to systems that grow with us. And the foundation of that growth is statefulness.
Let's Discuss
- The Tool Bloat Dilemma: As an agent creates more custom skills, how do you think we should handle semantic search over skills? Should the agent use vector embeddings to dynamically load only the top 3 relevant skills into its prompt, or is the Curator's active/archive model sufficient?
-
The Ethics of Agent Identity: If an agent's "Soul" (
SOUL.md) and "Memory" (MEMORY.md) are continuously modified by background threads, at what point does the agent's behavior drift too far from its original design? How would you implement "identity guardrails" to prevent an agent from editing its core safety principles?
Leave your thoughts and engineering approaches in the comments below!
The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook Hermes Agent, The Self-Evolving AI Workforce: details link, you can find also my programming ebooks with AI here: Programming & AI eBooks.





















