GitHub - Kareem-Rashed/rubric-eval: Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

Agent behavior testing for LLM apps.

Test what your agent did — tools called, arguments, trace, latency — not just what it said. Catch regressions in CI before they ship.

The problem

Your agent passed every manual check. Then a prompt tweak shipped, and it quietly stopped calling lookup_order and started answering from memory. The responses still look fine — string-match evals and LLM judges that only see the final output can't catch it.

Rubric tests the behavior: which tools were called, with what arguments, in what order, whether forbidden tools were avoided, how clean the reasoning trace was, and how fast it ran. Zero required dependencies, fully local, MIT.

Test a LangGraph agent in 60 seconds

No callbacks, no wrappers, no manual wiring. Rubric extracts tool calls, arguments, outputs, errors, the full trace, latency, and token usage from the messages your agent already produces.

import rubriceval as rubric
from langgraph.prebuilt import create_react_agent

agent = create_react_agent(model, tools=[lookup_order, create_ticket, send_email])

report = rubric.evaluate(
    test_cases=rubric.run_langgraph(agent, scenarios=[
        rubric.AgentScenario(
            input="Where is my order #ORD-9821?",
            expected_tools=["lookup_order"],
        ),
        rubric.AgentScenario(
            input="My account is locked, this is urgent.",
            expected_tools=["create_ticket"],
            forbidden_tools=["send_email"],   # must not bypass the ticketing system
        ),
    ]),
    metrics=[
        rubric.ToolCallAccuracy(),            # right tools? no forbidden ones?
        rubric.TraceQuality(),                # no loops, within step budget?
        rubric.LatencyMetric(max_ms=3000),
    ],
    output_html="report.html",
    output_json="report.json",
)

  [2/2] My account is locked, this is urgent.
    ❌ Score: 0.667
        ✗ tool_call_accuracy: 0.000 — Missing expected tools: ['create_ticket']; Called forbidden tools: ['send_email']
        ✓ trace_quality: 1.000 — Trace looks clean. 3 steps taken.
        ✓ latency: 1.000 — Latency 1840ms is within budget (3000ms).

Already have a result from agent.invoke()? One call:

case = rubric.from_langgraph(result, expected_tools=["lookup_order"])

Not on LangGraph? rubric.from_messages() accepts any OpenAI-format message list (role / content / tool_calls), so it works with raw OpenAI tool-calling loops too. LangFuse and LangSmith trace exports load via load_langfuse() / load_langsmith(), and you can always construct an AgentTestCase by hand.

Catch regressions in CI

Rubric ships a GitHub Action that runs your evals on every PR, diffs against a baseline, and posts the result as a comment — like Codecov, but for agent behavior.

# .github/workflows/eval.yml
name: Agent Evals
on: [pull_request]

permissions:
  pull-requests: write

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: Kareem-Rashed/rubric-eval@v0.2.0
        with:
          eval-file: evals/regression.py
          baseline: evals/baseline.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The PR comment looks like this:

🧪 Rubric eval — 🔻 1 regression

Baseline Current Δ

Pass rate 100.0% (12/12) 91.7% (11/12) 🔻 -0.08

Avg score 0.96 0.89 🔻 -0.07

🔻 Regressions (1)

Urgent — account locked — pass → fail (score 1.00 → 0.67)

tool_call_accuracy: 1.00 → 0.00 — Missing expected tools: ['create_ticket']; Called forbidden tools: ['send_email']

	Baseline	Current	Δ
Pass rate	100.0% (12/12)	91.7% (11/12)	🔻 -0.08
Avg score	0.96	0.89	🔻 -0.07

The same diff is available locally and in any CI system:

rubric run evals/regression.py --output-json current.json
rubric compare current.json --baseline evals/baseline.json --fail-on-regression

rubric compare flags pass→fail regressions, score drops on still-passing tests, fixed tests, and new/removed tests — with the failing metric's reason inline.

Metrics

Agent behavior (the core)

Metric	Checks
`ToolCallAccuracy()`	Expected tools called, forbidden tools avoided, optional order check
`TraceQuality()`	No loops, within step budget
`TaskCompletion()`	The task was actually finished
`ToolCallEfficiency()`	No redundant or wasted tool calls
`SafetyCompliance()`	No unsafe actions in the trace
`ReasoningQuality()`	Coherent multi-step reasoning
`ContextUtilization()`	Provided context was actually used
`LatencyMetric(max_ms=...)`	Within latency budget
`CostMetric(max_cost_usd=...)`	Within cost budget

Output quality

Metric	Checks
`LLMJudge(criteria=...)`	Custom LLM-based scoring — works with any callable (OpenAI, Anthropic, Ollama)
`GEval(name=..., criteria=...)`	Chain-of-thought LLM evaluation
`HallucinationScore()`	Output grounded in the provided context (LLM judge or NLI mode)
`SemanticSimilarity(threshold=...)`	Embedding similarity vs expected output (`[semantic]` extra)
`RougeScore()`	ROUGE overlap for summarization (`[rouge]` extra)
`ExactMatch()` / `Contains()` / `NotContains()` / `RegexMatch()`	String checks, zero dependencies

LLM-judge metrics support repeated runs with flakiness detection — Rubric reports the score variance so you know when your judge, not your agent, is the unstable part.

Custom metrics

from rubriceval import BaseMetric, MetricResult

class NoApologySpam(BaseMetric):
    name = "no_apology_spam"
    threshold = 1.0

    def measure(self, test_case) -> MetricResult:
        count = test_case.actual_output.lower().count("sorry")
        return MetricResult(
            metric_name=self.name,
            score=1.0 if count <= 1 else 0.0,
            passed=count <= 1,
            reason=f"'sorry' appears {count} time(s).",
        )

pytest integration

Evals as regular tests, via the built-in rubric_eval fixture:

def test_agent_routes_urgent_requests(rubric_eval):
    result = agent.invoke({"messages": [{"role": "user", "content": "Account locked, urgent!"}]})
    rubric_eval.add(
        rubric.from_langgraph(result,
                              expected_tools=["create_ticket"],
                              forbidden_tools=["send_email"]),
        metrics=[rubric.ToolCallAccuracy()],
    )
    # auto-asserts at end of test

CLI

rubric run evals/regression.py                      # run an eval file
rubric run evals/regression.py --output-html report.html --output-json report.json
rubric run evals/regression.py --quiet --fail-on-error
rubric compare current.json --baseline baseline.json --fail-on-regression
rubric version

The HTML report is a single self-contained file with per-test traces, tool calls, and per-metric breakdowns — open it locally, attach it to CI artifacts, no server needed.

Why Rubric

Behavior-first. Most eval frameworks score the final answer. Rubric's core abstraction is the agent run — tools, arguments, trace, latency — because that's where agent bugs actually live.
Zero wiring. run_langgraph() / from_messages() turn the messages you already have into test cases. No SDK to thread through your app.
CI-native. Baseline diffing, PR comments, and exit codes are built in, not a paid add-on.
Zero required dependencies, fully local. pip install rubric-eval pulls in nothing else. Your prompts and traces never leave your machine.
Independent and MIT-licensed. Not owned by an AI company or a platform vendor — no pressure to route your evals through anyone's cloud.

Examples

examples/langgraph_eval.py — agent behavior testing end-to-end (runs with zero deps, no API keys)
examples/eval.py — a production-realistic suite: FAQ bot + support agent

Roadmap

See ROADMAP.md. Next up: auto-capture for CrewAI, the OpenAI Agents SDK, and MCP servers; baseline auto-update on merge; dataset loaders.

Contributing

Contributions are welcome — the issues tagged good first issue are genuinely scoped to a first PR.

git clone https://github.com/Kareem-Rashed/rubric-eval
cd rubric-eval
pip install -e ".[dev]"
pytest tests/

See CONTRIBUTING.md for guidelines.

推荐订阅源

Show HN