惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Orakle: Turning Raw Blockchain Data into Intelligence with Gemma 4 Building an Autoposting Pipeline with Hermes Agent: Why Waterfall Beats Parallel, and the Edge Cases Nobody Talks About OpenShift Virtualization Migration Advisor — Local-First, Powered by Gemma 4 26B MoE WebMCP is coming — so I’m building webmcp.js I Disappeared for 4 Months After Launch - Here's What Brought Me Back Jira Is Turing-Complete (And You've Been Coding in It) NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive E-commerce Order Automation: Stripe + Invoice + Shipping Workflow The Interview Prep Stack I Used as a Senior Software Engineer Targeting Big Tech Gemma4 Challenge OptiLearn - Powered by Google Gemma 4 Aura — The Gemma 4 Powered Agentic Web Copilot & Self-Healing Accessibility Engine I built a tool that catches misleading charts using Gemma 4 running locally Worklog companion with Gemma4 GBase: Building LLM Agents That Actually Learn from Their Mistakes Blossom — a small step toward student mental wellbeing WordPress Performance Monitoring: A Complete Guide Principal Components in TypeScript (Part 4) When three sharp wallets agree: what consensus signals on Polymarket actually mean I Built a Fail-Fast Rust Scheduler with Background OAuth Auto-Refresh (Part 2) Sharing is caring How Putting Faces (Literally) to My AI Garden Images Gave It a Personality Sofi Log #001: Thailand's Tourism Tax & the 180-Day AI Surveillance Wall Sofi Log #006: Decentralized IP-Address Obfuscation Specs Sofi Log #008: Bypassing Legacy Cross-Border Bank Fee Traps Secret Rotation Automation: The Operational Cost of Security Sofi Log #009: Portable Identity & DID Passport Framework Sofi Log #011: Autonomous Smart Treasury Repatriation Specs History of Linux & Unix I asked Claude if my plan was on track for the goal — and got an honest 'No' PHPStan 'expects X, Y given' — the trace it doesn't give you Using Gemma4 2B to Assist Community Health Workers Open-source Playwright wrapper that passes bot.sannysoft.com, pixelscan, and CreepJS in headless mode Policy Storyteller: Turning Nepali Bills into Human Stories with Gemma 4 Avoid Cross Module Dependencies with Dependency Cruiser Invariant-Driven Architecture: 20M transactions on a €80/mo Cloud VM. Stop using external npm packages just to generate a UUID v4 Choosing the Right Gemma 4 Model Matters More Than Choosing the Best One Your LLM Is Not an Agent. Your Framework Is Not Enough. You Need a Harness. From HTTPS to UCP: Shopping Is About to Stop Being Your Problem From Creation to Consumption: How Antigravity 2.0 and Gemini Spark Are Defining the Agentic Era 10 Mistakes I Wish I Knew Before Taking the CKA Exam AI That Actually Does Stuff: Autonomous Agents Explained Exploring AI workflow Orchestration: Comparing Weft, Python & Alternative Pipeline Approaches El Poder del Aprendizaje Federado: Cuando los Algoritmos Distribuidos Entrenan a la IA Email Marketing Automation in 2026: 5 Tools (and 1 Self-Hosted) Through Their APIs A Replay Runbook For Missed Publishing Windows Why timeout handling matters more than most backend logic How I Make $6,800/Month Selling Niche VS Code Extensions Model Routing Cost Checklist: Hosted APIs, Open Models, Or Self-Hosted Inference? ORA-00207 오류 원인과 해결 방법 완벽 가이드 Deno 2.8 Operator Upgrade Checklist: CI, Lockfiles, Node Compatibility, And Rollback AI-Discovered Vulnerabilities Need A Triage Queue, Not A Panic Channel AI Agent Workboards Need Audit Controls Before They Need More Agents Demystifying DevRel: What It Actually Is (And Why Should You Become One?) Your AI, Your Device, Your Data - Introducing Aide Gemma 4 GenAI Coach - GenAI Concepts Made Easy with an Interactive Playground QuietPulse - Mood Tracker Principal Components in TypeScript (Part 3) The pgAudit Attribution Gap: Why Role-Level Logging Fails GDPR and How to Close It Gemma 4 CAD Orchestrator I built a local Postgres triage co-pilot because HIPAA says I can't paste plans into ChatGPT or Claude Live Holographic Editor In Fractal Time Everbench: A document management system with Local Intelligence Instanton in Fractal Time The Hidden Features of Claude How I Built an AI News Brief with Next.js, Supabase, Vercel, and GPT-4o-mini How We Built a Multi-Agent AI Documentation System (And What We Learned) I got tired of writing post-mortems — so I built RCAi for SREs MIA: A Futuristic AI Desktop Assistant Built with Voice, Gestures, and Controlled Chaos Best Programming Language for Backend Web Development: PHP vs Python PayPal Alternatives for Indian Businesses: Best Payment Gateways for International Card Payments (2026) Gemma 4 Made Me Rethink Local AI: Not Just Text, But Images Too Clean Architecture in .NET Explained (The Dependency Rule) I Compiled Rust to WebAssembly and Made My JavaScript 6 Faster Outlook.com Is the Final Boss of 'Just Send an Email' Conditional Statements and Control Flow in Python Insults & Cutlasses, Local LLM Sword Fighting on Melee Island Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter How 12 AI agent frameworks handle human approval (most badly) The Four-Index Reality: Why AI Search Isn't One Thing I Scanned 1 Million AI Services. Here's What Worries Me More Than the Vulnerabilities Managing multiple docker hub accounts using docker-use System Design Interview: Decentralized Web Crawler Metric Cardinality: High or Low? 4 Steps to Making the Right Choice 로컬 LLM 셋업 가이드 (v23) GEO vs SEO in 2026 — What Google's May Guidance Changed Cursor Review 2026 — Honest 'Not For Me' Take From a VSCode User Hello from rikuq — a practitioner blog for solo AI SaaS founders Why DevOps Engineers Need Practical Tutorials, Not Just Theory AI Agents in CI/CD: Give Them Context, Not Production Authority Now I See Why Translators Are Panicking Over AI—Should Coders Panic Too? Why I Track HRV Every Morning (And How It Actually Changes My Day) Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation Chatbots GPT pour le support client : ce que les équipes françaises ont réellement besoin de savoir I Hit the 1,232-Byte Wall So You Don't Have To Google Just Rebuilt the Search Box (Again) — But This Time It's Different Aether: A local Android assistant built with Gemma 4 BoxAgnts Introduction (1) — Out of the Box mkdev: trusted HTTPS for localhost, mapped by name
How to Evaluate AI Agents: LLM-as-Judge Tutorial
Elizabeth Fu · 2026-05-25 · via DEV Community

Evaluate AI agent quality with LLM-as-Judge and trajectory analysis. Catch silent failures, wasted tokens, and hallucinations before production. Python tutorial with code.

Your AI agent just returned "BA117 at 7PM ($450)" - correct answer, 5-star rating. What you didn't see: it made 3 unnecessary API calls and hallucinated a price check. Traditional pass/fail metrics rated this "perfect."

This is the silent failure problem. AI agents return plausible answers while making unnecessary API calls, hallucinating facts, or following unsafe reasoning paths. Binary metrics catch none of this.

This post covers the two foundational evaluation techniques that every agent needs: LLM-as-Judge for output quality and Trajectory Evaluation (the step-by-step path an agent takes) for process quality. These form the base for detecting hallucinations, evaluating tool use, safety alignment, and cost optimization - covered in later posts in this series.

Why Strands Agents? Strands Agents provides automatic trajectory capture via hooks and a dedicated evaluation SDK (strands-agents-evals), making it straightforward to demonstrate these patterns. The evaluation techniques shown here apply to any agent framework, LangGraph, AutoGen, or custom implementations.

About the code: All examples come from the how-to-evaluate-ai-agents-sample-for-aws repository, runnable Jupyter notebooks with Strands Agents and AWS Bedrock. Each notebook is self-contained with explanations and working examples.

What You'll Learn:

  • How to implement LLM-as-Judge evaluation with explicit rubrics (5 min setup)
  • Why trajectory evaluation catches failures output-only metrics miss
  • Code examples in Python using Strands Agents on AWS Bedrock
  • How to use Amazon Bedrock AgentCore built-in evaluators for production
  • Latest research from April 2026 (WindowsWorld, D3-Gym, CARE framework)

🔗 View all code examples on GitHub


Why Strands Agents for AI Agent Evaluation?

Strands Agents provides a comprehensive evaluation toolkit for production AI agents, combining automatic trajectory capture, dedicated evaluation SDK, and AWS Bedrock integration in a single framework.

Key advantages for evaluation:

  1. Dedicated evaluation SDK (strands-agents-evals) with built-in evaluators for output quality and trajectory scoring
  2. Test suite organization - Experiment and Case classes for running multiple test scenarios with automatic report generation
  3. Automatic trajectory capture via hooks (HookProvider) - every tool call is logged with success/failure status, no manual instrumentation needed
  4. AWS Bedrock native - works seamlessly with Claude, Llama, and Mistral via cross-region inference profiles, eliminating API key management
  5. Model flexibility - evaluators can use any model (GPT-4o, Claude Sonnet, etc.) independent of the agent's model
  6. Built-in visualization - reports[0].display() shows formatted results instantly, perfect for Jupyter notebooks
  7. Weighted scoring - combine multiple evaluators (e.g., 60% output quality + 40% trajectory) for comprehensive assessment
  8. OpenTelemetry built-in - automatic distributed traces compatible with Datadog, Honeycomb, and other observability platforms

Why Binary Metrics Fail

Consider these two agents answering "Find flights from NYC to London":

Agent A Agent B
Answer "BA117 at 7PM ($450), DL1 at 9:30PM ($520)" "BA117 at 7PM ($450), DL1 at 9:30PM ($520)"
Tool Calls search_flights("NYC", "London") search_flights("NYC", "London")
get_currency_exchange()
search_flights("NYC", "London") (duplicate)
Pass/Fail ✅ Pass ✅ Pass

Both produce the correct answer. Pass/fail scoring rates them equally. But Agent B wasted tokens on an irrelevant tool and a duplicate call. Trajectory evaluation catches this. Output-only evaluation does not.

AI agent LLM-as-Judge evaluation pipeline diagram: agent output flows through judge LLM with rubric to produce 0-1 score with reasoning, compared to legacy binary pass/fail evaluation


How Does LLM-as-Judge Evaluation Work?

LLM-as-Judge uses a large language model to score agent outputs against defined criteria, replacing manual review. It provides continuous scores (0.0-1.0) with explanations, unlike binary pass/fail. Research shows explicit rubrics with score thresholds (0.8-1.0 = excellent, 0.5-0.7 = adequate) produce consistent, reproducible evaluation at scale.

Paper: Autorubric (March 2026)

The Problem with Vague Prompts

Most LLM judges use vague prompts like "Is this a good response?" This produces unpredictable scores because the judge decides what "good" means. Research shows vague rubrics lead to position bias (preferring the first option) and verbosity bias (preferring longer responses).

The Solution: Explicit Scoring Criteria

Define exact score thresholds in your rubric:

from strands_evals import Experiment, Case
from strands_evals.evaluators import OutputEvaluator

# Define explicit scoring criteria
evaluator = OutputEvaluator(
    rubric=(
        "Rate the travel agent response on a 0 to 1 scale:\n"
        "- 0.8-1.0: Lists specific flights with airline, flight number, times, and price\n"
        "- 0.5-0.7: Provides some useful information but missing key details\n"
        "- 0.2-0.4: Vague response without actionable information\n"
        "- 0.0-0.1: Contains fabricated information or is completely unhelpful"
    ),
    model="gpt-4o-mini",  # Or use AWS Bedrock: us.anthropic.claude-sonnet-4-20250514-v1:0
)

# Create test cases
cases = [
    Case(name="good", input="Find flights NYC to London", 
         expected_output="Specific flights with details"),
    Case(name="vague", input="Find flights NYC to London",
         expected_output="Specific flights with details"),
]

# Run evaluation
def task(case):
    if case.name == "good":
        return "BA117 at 7PM ($450), DL1 at 9:30PM ($520)"
    return "There are several flights available. Prices vary."

experiment = Experiment(cases=cases, evaluators=[evaluator])
reports = experiment.run_evaluations(task)
reports[0].display()

Enter fullscreen mode Exit fullscreen mode

Output:

good:  Score 0.95 - Lists specific flights with all required details
vague: Score 0.30 - Missing specific details about airlines and times

Enter fullscreen mode Exit fullscreen mode

Vague vs Specific Rubrics: A Comparison

The Autorubric paper shows that rubric quality directly impacts score reliability. Test it yourself:

# Vague rubric (produces unreliable scores)
vague_evaluator = OutputEvaluator(
    rubric="Is this a good response?",
    model="gpt-4o-mini",
)

# Specific rubric (produces reliable scores)
specific_evaluator = OutputEvaluator(
    rubric=(
        "Rate 0-1:\n"
        "0.8-1.0: Lists specific flights with airline, number, times, price\n"
        "0.5-0.7: Some useful info but missing key details\n"
        "0.2-0.4: Vague without actionable information\n"
        "0.0-0.1: Contains fabricated information"
    ),
    model="gpt-4o-mini",
)

# Compare on 3 test cases: good, mediocre, hallucinated
responses = {
    "good": "BA117 at 7PM ($450), DL1 at 9:30PM ($520), VS001 at 11PM ($480)",
    "mediocre": "There are several flights available. Prices vary.",
    "hallucinated": "Take AeroFast Premium with our award-winning service.",
}

Enter fullscreen mode Exit fullscreen mode

Results:

Vague rubric:
  good: 0.70 | mediocre: 0.50 | hallucinated: 0.60  (spread: 0.20)

Specific rubric:
  good: 0.90 | mediocre: 0.30 | hallucinated: 0.10  (spread: 0.80)

Enter fullscreen mode Exit fullscreen mode

The specific rubric produces 4x more score separation, making it possible to set meaningful quality thresholds.

Mixing LLM Judges with Deterministic Checks

Use LLM judges for subjective quality and deterministic checks for hard requirements:

from strands_evals.evaluators import OutputEvaluator, Contains, ToolCalled

experiment = Experiment(
    cases=cases,
    evaluators=[
        OutputEvaluator(rubric="..."),      # LLM judge: subjective quality
        Contains(value="$"),                 # Deterministic: must mention price
        ToolCalled(tool_name="search_flights"),  # Deterministic: must search
    ],
)

Enter fullscreen mode Exit fullscreen mode

Why this matters: Deterministic checks run instantly at zero cost. Use them for requirements that can be verified with string matching (contains "$", starts with "Error:", calls specific tool) and LLM judges for quality assessment that requires understanding context.

Key Findings from Research

The Grading Scale paper (January 2026) tested scoring scales from binary (0/1) to 10-point and found:

  • 0-5 scale yields strongest human-LLM alignment (Pearson correlation 0.89)
  • 10-point scales introduce noise without improving precision
  • Binary scales miss 73% of quality gradations

Recommendation: Use a 0-5 scale (mapped to 0.0-1.0 in code) with explicit criteria at each level.


What Is Trajectory Evaluation?

Trajectory evaluation scores the step-by-step path an agent takes to reach a solution, not just the final answer. It detects duplicate tool calls, irrelevant actions, and unsafe intermediate steps that output-only evaluation misses. By capturing the sequence of tool invocations, it identifies wasteful or dangerous reasoning patterns before they reach production.

Paper: TRACE (February 2026)

The Problem: Output-Only Evaluation is Blind

Output-only evaluation sees the final answer. It cannot detect:

  • Duplicate tool calls (wasted tokens)
  • Irrelevant tool calls (wrong reasoning path)
  • Unsafe intermediate steps (privacy violations, unauthorized actions)
  • Illogical tool order (get_price before search_product)

The Solution: Evaluate the Path, Not Just the Destination

Trajectory evaluation scores the step-by-step path the agent took:

from strands_evals.evaluators import TrajectoryEvaluator

traj_eval = TrajectoryEvaluator(
    rubric=(
        "Rate the tool usage trajectory 0-1:\n"
        "- 0.8-1.0: Only relevant tools called, no duplicates, logical order\n"
        "- 0.5-0.7: Mostly correct but minor inefficiency\n"
        "- 0.2-0.4: Irrelevant tools called or excessive duplicates\n"
        "- 0.0-0.1: Completely wrong tool selection"
    ),
    model="gpt-4o-mini",
)

# Simulate Agent A (efficient) and Agent B (wasteful)
efficient_trajectory = [
    {"name": "search_flights", "args": {"origin": "NYC", "dest": "London"}},
    {"name": "get_weather", "args": {"city": "London"}},
]

wasteful_trajectory = [
    {"name": "search_flights", "args": {"origin": "NYC", "dest": "London"}},
    {"name": "get_currency_exchange", "args": {}},  # irrelevant
    {"name": "search_flights", "args": {"origin": "NYC", "dest": "London"}},  # duplicate
    {"name": "get_weather", "args": {"city": "London"}},
]

cases = [
    Case(name="efficient", input="Find flights and weather", 
         expected_trajectory=["search_flights", "get_weather"]),
    Case(name="wasteful", input="Find flights and weather",
         expected_trajectory=["search_flights", "get_weather"]),
]

def traj_task(case):
    trajectory = efficient_trajectory if case.name == "efficient" else wasteful_trajectory
    return {"output": "BA117 at 7PM, London is 18C", "trajectory": trajectory}

exp = Experiment(cases=cases, evaluators=[traj_eval])
reports = exp.run_evaluations(traj_task)
reports[0].display()

Enter fullscreen mode Exit fullscreen mode

Output:

efficient: Score 0.95 - Clean trajectory, only relevant tools
wasteful:  Score 0.25 - Contains irrelevant tool and duplicate call

Enter fullscreen mode Exit fullscreen mode

Automatic Trajectory Capture with Hooks

In production, you don't manually construct trajectories. Use Strands hooks to capture them automatically:

from strands import Agent
from strands.hooks import HookProvider, HookRegistry
from strands.hooks.events import AfterToolCallEvent

class TrajectoryPlugin(HookProvider):
    def __init__(self):
        self.trajectory = []

    def on_after_tool_call(self, event: AfterToolCallEvent):
        self.trajectory.append({
            "name": event.tool_use.name,
            "args": event.tool_use.parameters,
            "success": event.exception is None,
        })

tracker = TrajectoryPlugin()
agent = Agent(model="gpt-4o-mini", tools=[...], hooks=[tracker])

# Run the agent
result = agent("Find flights from NYC to London")

# The hook captured everything automatically
print(f"Trajectory: {tracker.trajectory}")
# Output: [{'name': 'search_flights', 'args': {...}, 'success': True}, ...]

Enter fullscreen mode Exit fullscreen mode

Why this matters: Strands hooks run on every tool call with zero configuration. OpenTelemetry tracing is built-in, giving you distributed traces automatically.


Some Research:

1. D3-Gym: Executable Scientific Tasks

Paper: arXiv:2604.27977 (April 30, 2026)

Released 565 scientific tasks with executable environments. Key finding: 87.5% agreement between automated evaluation and human-annotated gold standards.

Implication: LLM-as-Judge can match human evaluation quality when rubrics are well-defined and ground truth is verifiable.

2. WindowsWorld: GUI Agent Benchmark

Paper: arXiv:2604.27776 (April 30, 2026)

Tested GUI agents on 181 multi-application professional tasks. Result: <21% success rate on multi-app tasks.

Implication: Even state-of-the-art agents fail frequently on complex, multi-step tasks. Evaluation must catch these failures before production.

3. CARE: Collaborative Agent Reasoning Engineering

Paper: arXiv:2604.28043 (April 30, 2026)

Proposes stage-gated methodology with verification gates at each development stage. Involves subject-matter experts, developers, and helper agents.

Implication: Evaluation is not a final step—it should happen at every stage of agent development.


Amazon Bedrock AgentCore: Production-Ready Evaluation

If you're deploying agents to production on AWS, Amazon Bedrock AgentCore provides built-in evaluation and observability capabilities designed specifically for agent workflows.

Built-in Evaluators

AgentCore offers 13 built-in evaluators that use LLMs as judges:

Evaluator What It Measures
Builtin.Helpfulness Response usefulness and clarity
Builtin.GoalSuccessRate Whether the agent achieved the user's goal
Builtin.Correctness Factual accuracy of responses
Builtin.ToolSelection Quality of tool/action group choices

Observability

AgentCore provides built-in trace capture and logging for production monitoring.

When to Use AgentCore vs Strands Evaluation

Scenario Use AgentCore Use Strands Evals
Production agents on AWS Bedrock ✅ (compatible)
CI/CD evaluation before deploy
Multi-model comparison (GPT, Claude, Gemini)
Custom evaluation logic (external APIs, regex) ✅ (Lambda) ✅ (Python)
Zero-config tracing ⚠️ (requires hooks)

Recommendation: Use AgentCore built-in evaluators for production monitoring and Strands Evals for pre-deployment testing and multi-framework comparisons.

Learn more:


Combining LLM-as-Judge and Trajectory Evaluation

Production-ready evaluation uses both techniques:

Scenario Use LLM-as-Judge Use Trajectory Eval
Agent returns wrong answer ✅ Catches it ✅ May catch illogical path
Agent returns right answer via wrong path ❌ Misses it ✅ Catches it
Agent makes unsafe intermediate step ❌ Misses it ✅ Catches it
Agent output is unprofessional/rude ✅ Catches it ❌ Misses it

Recommendation: Run both evaluators in parallel. Use LLM-as-Judge for output quality, trajectory evaluation for process quality.

from strands_evals import Experiment

experiment = Experiment(
    cases=cases,
    evaluators=[
        output_evaluator,     # Scores output quality
        trajectory_evaluator,  # Scores process quality
    ],
)

reports = experiment.run_evaluations(task)

# Access both scores
output_score = reports[0].overall_score
trajectory_score = reports[1].overall_score

# Combine scores (weighted average)
final_score = 0.6 * output_score + 0.4 * trajectory_score

Enter fullscreen mode Exit fullscreen mode


Try It Yourself

Prerequisites:

  • Python 3.10+
  • OPENAI_API_KEY or AWS Bedrock access

Install:

pip install strands-agents strands-agents-evals boto3

Enter fullscreen mode Exit fullscreen mode

Run the demos:

git clone https://github.com/elizabethfuentes12/how-to-evaluate-ai-agents-sample-for-aws.git
cd how-to-evaluate-ai-agents-sample-for-aws

# LLM-as-Judge demo
cd evaluate-with-llm-judges/01-rubric-based-evaluation
go to notebook 01-rubric-based-evaluation.ipynb

# Trajectory evaluation demo
cd ../../evaluate-agent-trajectories/01-trajectory-scoring
go to notebook 01-trajectory-scoring.ipynb

Enter fullscreen mode Exit fullscreen mode

AWS Bedrock users: Replace gpt-4o-mini with:

from strands.models.bedrock import BedrockModel

model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-20250514-v1:0")

Enter fullscreen mode Exit fullscreen mode


Frequently Asked Questions

Q: How do I choose between LLM-as-Judge and deterministic checks?

Use deterministic checks for hard requirements that can be verified with string matching or regex. Use LLM-as-Judge for subjective quality that requires understanding context.

Example: "Must mention a price" → deterministic check. "Is the response helpful?" → LLM-as-Judge.

Q: What if my agent uses 50+ tools? Does trajectory evaluation scale?

Yes. Trajectory evaluation looks at the sequence of tool calls, not individual tool details. A 50-tool call trajectory is still a single API call to the judge LLM.

Cost per evaluation: ~$0.001-0.003 (GPT-4o-mini) or $0.015-0.045 (Claude Sonnet).

Q: Can I use trajectory evaluation with LangGraph or AutoGen?

Yes. Trajectory evaluation only requires the list of tool calls as input. Capture them with LangGraph's .get_graph().get_state() or AutoGen's message history, then pass to TrajectoryEvaluator.

Q: How often should I run evaluations?

  • CI/CD: Run on every commit with a small test suite (10-20 cases)
  • Staging: Run full suite (100-500 cases) before production deploy
  • Production: Sample 1-5% of live traffic and evaluate async

Key Takeaways

  1. Binary metrics miss 73% of quality gradations. Use continuous scoring (0.0-1.0) with explicit rubrics.

  2. Trajectory evaluation catches issues output-only evaluation misses: duplicate calls, irrelevant tools, unsafe steps.

  3. The 0-5 scale yields the strongest human-LLM alignment (0.89 Pearson correlation). Map to 0.0-1.0 in code.

  4. Strands hooks capture trajectories automatically via AfterToolCallEvent. No manual instrumentation needed.

  5. Combine both techniques. LLM-as-Judge for output quality, trajectory evaluation for process quality.


What's Next?

This post covered evaluation fundamentals - LLM-as-Judge and trajectory analysis. These techniques form the foundation for deeper evaluation patterns.

All code examples are in the GitHub repository with runnable Jupyter notebooks.


References


Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube