The production run failed at step 23 out of 47. The logs show "tool call error" but not why. The conversation history that led to step 23 is not in the logs. Reproducing the failure requires setting up the same initial conditions and running all 23 steps again.
agent-debug-replay is a step-through replay for agents that log their runs with agent-step-log. Load a run's JSONL file, advance step by step, inspect state at each step, and find exactly where things went wrong.
The Shape of the Fix
from agent_debug_replay import DebugReplay
replay = DebugReplay(log_path="./logs/run-a3f8b2c1.jsonl")
print(f"Total steps: {replay.total_steps}")
print(f"Run ID: {replay.run_id}")
print(f"Result: {replay.result}")
# Step through
for step in replay.steps():
print(f"Step {step.index}: {step.action_type}")
print(f" Input: {step.input_summary}")
print(f" Output: {step.output_summary}")
if step.error:
print(f" ERROR: {step.error}")
print(f" Full context at this step:")
print(replay.context_at(step.index))
break
Load the JSONL. Step through. Find the failure. Fix it.
What It Does NOT Do
agent-debug-replay does not re-execute the steps. It replays the recorded state. If you want to re-run from step 23 with a fix, you need to use agent-resume with a checkpoint up to step 22.
It does not visualize. It is a data access API. Build your own visualization on top (a CLI, a notebook, a web UI) using the step data it provides.
It does not diff two runs. For comparison between a failing and a passing run of the same task, load both as separate DebugReplay objects and compare step by step yourself.
Inside the Library
Each step in the JSONL is a structured record from agent-step-log:
{"run_id": "a3f8b2c1", "step": 23, "action": "tool_call",
"tool": "web_search", "args": {"query": "..."},
"result": null, "error": "timeout", "ts": 1748107200,
"messages_count": 46, "tokens_used": 12847}
DebugReplay reads the file and indexes steps by step number. Navigation:
class DebugReplay:
def steps(self) -> Iterator[Step]:
for record in self._records:
yield Step.from_record(record)
def step_at(self, n: int) -> Step:
return self._by_index[n]
def context_at(self, n: int) -> ReplayContext:
"""Return all state accumulated up to step n."""
return ReplayContext(
steps=self._records[:n+1],
total_tokens=sum(r["tokens_used"] for r in self._records[:n+1]),
tools_called=[r["tool"] for r in self._records[:n+1] if r.get("tool")],
errors=[r["error"] for r in self._records[:n+1] if r.get("error")],
)
def find_first_error(self) -> Step | None:
for step in self.steps():
if step.error:
return step
return None
context_at(n) is the key method for debugging: it aggregates all state that existed at step n — total tokens, all tools called, all errors seen. This gives you the full picture at the failure point without scanning the log manually.
When to Use It
Use it whenever a production agent run fails and you need to understand why. The workflow:
- Agent runs with
agent-step-logrecording each step - Run fails
- Load the JSONL with
DebugReplay - Call
find_first_error()to jump to the failure - Call
context_at(error_step.index)to see accumulated state - Find the root cause
Without step logging, you are working from timestamp-correlated logs and hope. With it, you have a complete, structured record of every agent action.
Skip it for agents that run for under 5 steps. The overhead of step logging is not worth it for short runs.
Install
pip install git+https://github.com/MukundaKatta/agent-debug-replay
from agent_debug_replay import DebugReplay
import sys
def debug_failed_run(log_path: str) -> None:
replay = DebugReplay(log_path=log_path)
first_error = replay.find_first_error()
if not first_error:
print("No errors found in this run.")
return
print(f"First error at step {first_error.index}")
print(f"Action: {first_error.action_type}")
print(f"Error: {first_error.error}")
ctx = replay.context_at(first_error.index)
print(f"\nContext at failure:")
print(f" Total tokens used: {ctx.total_tokens}")
print(f" Tools called: {', '.join(ctx.tools_called)}")
print(f" Previous errors: {ctx.errors[:-1]}") # all errors before this one
print(f"\nStep before failure:")
if first_error.index > 0:
prev = replay.step_at(first_error.index - 1)
print(f" {prev.action_type}: {prev.output_summary}")
if __name__ == "__main__":
debug_failed_run(sys.argv[1])
Sibling Libraries
| Library | What it solves |
|---|---|
agent-step-log |
Per-step JSONL logging (produces the log that debug-replay reads) |
agent-run-id |
Run IDs for correlating logs to specific runs |
agent-resume |
Resume from a checkpoint after finding and fixing the failure |
agenttap |
Wire-level capture for even more detailed replay |
agent-decision-log |
WHY-layer logs that complement step-level WHAT logs |
The debugging workflow: agent-step-log records the run, agent-debug-replay lets you navigate it, agent-resume lets you re-run from a checkpoint after the fix.
What's Next
Notebook integration: a Jupyter widget that renders the step-through as an interactive cell. Click forward/backward through steps and see state update in-place.
Diff replay: given two run logs (one passing, one failing), highlight the divergence point — where the two runs started producing different steps. This is the most common debugging scenario for regressions.
A web UI: a minimal Flask/FastAPI endpoint that serves a step-through UI in the browser. Load the JSONL, browse steps, click into context. This would be significantly more usable than the programmatic API for non-engineering stakeholders.
Built as part of the agent-stack family: composable Python primitives for production LLM agents.





















