惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

How I Prepared for CKA: Resources, Labs, and Strategy That Worked for Me The Misleading "User is not authorized to access connection" Error in AWS CodeBuild — and Why Your IAM Policy Looks Fine I Resurrected a Dead F1 Project and Accidentally Built a Race Intelligence OS Remix Mini PC: After a Year of Dead Ends, the eMMC Finally Talks Not All Games Are Equal: The Real Difference Between a Trap and a Tool How to add Peppol e-invoicing to your SaaS without making it your team's problem I Built a Hermes Agent to Tell Me Which Hackathons to Enter. It Told Me to Enter This One. The Five Hooks That Change How You Ship With Claude Code Powering Your Progress: Building Robust Solutions with Laravel I built a self-hosted CI/CD platform with persistent queue, encrypted secrets, and rollback UI — here's what I learned Antigravity 2.0 and the $1,000 OS: Why "Agent-First" Feels Like the Direction I've Been Building Toward Anyway I built an AI PR-triage agent in 30 lines of Markdown Core Web Vitals from 74 to 91: A Real Tax Practitioner Site Rebuild I Gave Gemma 4 150 Tools on Windows. Here's What Actually Happened. Beyond the Loop: Why Monolithic AI Agents Fail and How to Build a Microkernel Architecture The Hidden Tax of AI-Assisted Development (And How I Fixed It) I Ditched Cloud LLMs for Gemma 4 4B: A DevOps Engineer's 48-Hour Reality Check Building a Schema.org @graph That Validates on the First Try The "Lift and Shift" Trap: Why Your Integration Layer Needs More Than Just a Cloud Address All 7 OSI Layers Explained with Real-World Analogies Antigravity 2.0 in one day: the four shells and what each is good for Self-Hosting Google Fonts with size-adjust: Zero CLS Web Font Swap The Multi-Provider LLM Problem: Why “One API” Is Not Enough How I indexed 69,000 Claude Code skills (and what I learned doing it) RememberMe CareGrid: Local Gemma 4 for dementia memory and safety Google Is Killing Gemini CLI on June 18. Here Is What to Do Before Then Do Domínio ao Deploy: Hospedando Arquivos de Deep Links no Cloudflare Pages (Parte 7.1) Running Gemma 4 26B on an Old GTX 1080 with llama.cpp Devlog 1: I tried building an SNES game with the super FX chip Why Gemma 4 Feels Like an Important Moment for AI Developers✨ From Zero and Confused, This Is How I Started Learning to Code I Built a Local AI Gateway That Talks to Claude, ChatGPT, DeepSeek and Gemini — Without a Single API Key Bootstrapping with AI: Why Gemma 4 is the Micro-SaaS Founder’s Best Friend MyErp Architecture Series - #02 Cellular Architecture: Mapping Biology to Software Systems NodeJS vs Bun vs Go 🌍 RTL Arabic Style UI How Does an AI Agent Actually Buy Something? Google Just Published the Spec. Google I/O 2026 Is One Uncanny F.R.I.E.N.D.S Group Upgrade I Replaced 70MB Node.js Log Viewer with a 172KB Zig Binary The "MTTR Is All You Need" Trap The Quiet Revolution: How Firebase Became the First Agent-Native Backend at Google I/O 2026 I Built ResuMate! A 100% Private, Local AI Resume Optimizer with Google Gemma 4 Learning DirectX 12 - Part 2 Initialization Theory NeuralHats: I Put Edward de Bono’s Six Thinking Hats on Local LLMs Using Gemma 4 📝 Instant Auto Save Notes Engineering the "App-Like" Experience: A Deep Dive into PWA Architecture I built a local first AI CCTV assistant using Gemma 4 + Frigate CrowdShield AI — Smart Stadium Operating System & Crowd Intelligence Platform I built a free AI observability tool, prove your AI is useful, not just running Beyond Autocomplete: Why Google Antigravity 2.0 Changes the Rules for Indie Builders 터미널 AI 에이전트 구축 (v12) Building Instagram-Powered Apps with HikerAPI (Without Fighting Scrapers) Checkpoints, Not Transcripts: Rethinking AI Coding Agent Memory From Side Project to Student Savior: My AI PPT & Resume Tool Crossed 1.5K+ Users Why Story Points Don’t Work in the AI Era, And What Should Take Their Place Instead. Self-Hosted Document AI: How to Run Document Intelligence On Your Own Infrastructure (2026) How to Extract Tables from PDFs with AI: 4 Methods That Actually Work (2026) IDP vs OCR: What's the Difference — and Which Does Your Business Actually Need? Automated PII Detection and Redaction in Business Documents: A Practical Guide Human-in-the-Loop Document Review: When to Use It and How to Set It Up (2026) Document Processing Without RPA: A Modern Approach for Small Teams Reducto Alternative: When You Need More Than a Document Parser (2026) Hermes Agent vs LangChain vs CrewAI: When to Reach for Each SparshAI: I Built an Offline AI Tutor for Students Using Gemma 4 — Here's What Happened Building NeuroSense AI: A Human-Centered Stress Insight Assistant Powered by Gemma Why I Built a Privacy-First Dev Toolkit GAS Input Tags: Ability Activation Without Hardcoded Bindings AI Legal Document Advisor Supported By Gemm 4 Model Building Convertify in Public Week 10: PDF Cluster + Blog Launch CureNet AI: Decentralized Health Intelligence for India, Powered by Gemma 4 and ABHA Standardization When Open-Weights AI Meets a Broken Healthcare System: Deploying Gemma 4 in Rural India V.A.L.I.D. Google I/O 2026: The Year Google Stopped Building AI Assistants and Started Shipping AI Engineers Bondmap: AI-Powered Relationship Network That Maps How You're Connected to Everyone Using Gemma 4 Gemma 4 challenge inspired me to build my first app! 96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop From a Student Who Used CircuitVerse to a GSoC Contributor — My Community Bonding Story How Bf-Tree Keeps Mini-Pages Small, Hot, and Cheap to Evict I asked Claude to explain the chip war and ended up understanding modern geopolitics differently Stop Manually Checking for Server Updates: Automate With Email Notifications Nostalgia Meets Cybersecurity: Spotting Modern Scams in a Retro OS Simulator - Forward or Fraud CRACKING CODING INTERVIEW From Python to Production Pipeline :A Practical guide to Apache Airflow Antigravity 2.0: Google Just Changed What It Means to Be an Engineer I Built a Free Sticker Maker Because Every Other One Hid the Export How I bypassed Blazor WebAssembly's Virtual DOM using raw WASM pointers Distributed Tracing for LLM Agents: When MCP Makes Tool Calls Observable The Zero-Budget Memory Setup Behind My AI Agent Workflow No database. No framework. Just files, startup order, correction logs, and discipline. I Built an AI Second Brain with Gemma 4 The Most Exciting Google I/O 2026 Announcement for Me: HTML-in-Canvas CrisisLens: Compressing Disaster Scenes into 200-Byte Emergency Payloads with Gemma 4 I'm 15 and I built a todo app with Telegram Stars payments — only legal way for me to monetize before turning 18 Crypto Branding After the Token Launch Building an on-chain alerts bot in Python without any blockchain library FinePrint — An AI Pocket Lawyer That Decodes Predatory Contracts Using Gemma 4 How to Connect OpenAI with Supabase in 10 Minutes for a Lightning-Fast AI MVP One AI Gateway for AWS Bedrock, Google Vertex AI, Gemini, and Anthropic Reading Log #9 — Aoashi The Tacit Dimension Thinking, Fast and Slow Web3 Onboarding Is Not a Wallet Problem. It Is a Trust Problem. FHE Prompt Privacy: The Metadata Leak Your Demo Still Has
Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks
Anjaiah Meth · 2026-05-25 · via DEV Community

Let me be brutally honest with you.

I've seen teams demo AI agents that look incredible — smooth responses, beautiful UI, stakeholders impressed. Then that same team ships to production and spends the next three weeks firefighting hallucinations they could have caught in testing.

The problem isn't the AI. The problem is nobody evaluated it properly.

Not because they didn't want to. Because the existing tools made it painful.

You're building with LangGraph on Monday. LlamaIndex RAG pipeline on Wednesday. The product team wants CrewAI by Friday. Every framework has different output shapes. Every eval tool wants you to rebuild your stack around it.

So you ship anyway. With fingers crossed.

That's the exact problem I set out to solve with Custom Evals.


What Is Custom Evals?

Custom Evals is an open-source, lightweight evaluation framework for LLM outputs with support for 17+ agent frameworks and a multi-layer metric system — from fast deterministic checks to full LLM-as-judge scoring.

pip install -e ".[dev]"

Enter fullscreen mode Exit fullscreen mode

That's it. No required backend. No dashboard to stand up. No mandatory test runner.

Here's your first evaluation in 10 lines:

from custom.evals import CoherenceEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CoherenceEvaluator(llm)

score = evaluator.evaluate({
    "input": "What is AI?",
    "output": "AI is artificial intelligence, enabling machines to perform intelligent tasks."
})

print(f"{score.label}: {score.explanation}")
# coherent: The response provides a clear, logical explanation...

Enter fullscreen mode Exit fullscreen mode

A Score object. A label. An explanation. That's the entire interface.


Why Existing Tools Leave Gaps

I want to be fair here — the existing eval tools are genuinely good. But they each have a niche.

Phoenix Evals (Arize) is brilliant if you're deep in the Arize observability ecosystem. The Custom Evals architecture is openly inspired by it. But Phoenix is a full observability platform. If you just want to score outputs without standing up a tracing infrastructure, it's overkill.

DeepEval has 50+ metrics — impressive. But it requires a specific test runner, a specific file format, and an opinionated workflow. It's a comprehensive evaluation suite, not a lightweight library.

RAGAS is surgical and excellent at RAG evaluation specifically. Faithfulness, AnswerRelevancy, ContextPrecision — the research is solid. But it's RAG-first. It doesn't cover general LLM evaluation, agent tool-use quality, or document extraction accuracy.

The gap: none of them give you a single unified interface that works across 17 different frameworks without requiring a backend.


The Architecture: Four Evaluation Layers

The interesting design choice in Custom Evals is that there's no single "evaluator." There are four distinct layers. Use any or all of them.

Custom Evals
├── Layer 1: Code-Based Metrics       (deterministic, zero LLM cost)
├── Layer 2: LLM-Based Evaluators     (LLM-as-judge, semantic quality)
├── Layer 3: NLP Similarity Metrics   (BLEU, ROUGE, cosine, Jaro-Winkler...)
└── Layer 4: OCR / Document Metrics   (for non-LLM extraction pipelines)

Enter fullscreen mode Exit fullscreen mode

Layer 1 — Deterministic Checks (Free & Fast)

No LLM call. No latency. No API cost. Just math.

from custom.evals.metrics import exact_match, sentiment_score

score = exact_match({"output": "Paris", "expected": "Paris"})
# Score(score=1.0, label="exact_match")

score = sentiment_score({"output": "The product is absolutely fantastic!"})
# Score(score=0.9, label="positive")

Enter fullscreen mode Exit fullscreen mode

Want to register your own in 3 lines? Use the decorator:

from custom.evals import create_evaluator, Score

@create_evaluator(name="json_validity", direction="maximize")
def json_validity(output: str) -> Score:
    import json
    try:
        json.loads(output)
        return Score(score=1.0, label="valid", name="json_validity")
    except:
        return Score(score=0.0, label="invalid", name="json_validity")

Enter fullscreen mode Exit fullscreen mode

Layer 2 — LLM-as-Judge (The Semantic Layer)

Four production-ready evaluators ship out of the box:

Evaluator What It Measures Needs Ground Truth?
HallucinationEvaluator Does output contradict its context? No
CorrectnessEvaluator Factually correct vs expected answer? Yes
RelevanceEvaluator Does it actually answer the question? No
CoherenceEvaluator Logical flow and internal consistency? No

Plus two RAG-specific ones: FaithfulnessEvaluator and AnswerRelevancyEvaluator.

One subtle but important detail — every evaluator declares a DIRECTION:

class HallucinationEvaluator(LLMEvaluator):
    DIRECTION = "minimize"  # Lower score = less hallucination = better ✅

Enter fullscreen mode Exit fullscreen mode

class CoherenceEvaluator(LLMEvaluator):
    DIRECTION = "maximize"  # Higher score = more coherent = better ✅

Enter fullscreen mode Exit fullscreen mode

This means your test thresholds work correctly regardless of metric semantics. You don't need to remember "is higher hallucination score good or bad?" — the evaluator tells you.

Layer 3 — NLP Similarity Metrics

Seven industry-standard metrics for reference-based comparison, no LLM required:

  • BLEU Score — N-gram precision, the machine translation standard
  • ROUGE-N / ROUGE-L — Recall-oriented overlap, the summarization gold standard
  • Jaro-Winkler — Edit distance with prefix weighting, great for entity matching
  • Dice Coefficient — Bigram overlap, fast and symmetric
  • Token F1 Score — Precision/recall at the token level
  • Cosine Similarity (TF-IDF) — Vector-space document comparison
from custom.evals.metrics import bleu_score, rouge_n, cosine_similarity_tfidf

result = bleu_score({
    "output": "The model predicts outcomes accurately",
    "expected": "The model accurately predicts outcomes"
})
print(result.score)  # 0.71
print(result.metadata)  # {"brevity_penalty": 1.0, "n_gram_precisions": [...]}

Enter fullscreen mode Exit fullscreen mode

All seven return the same standardized Score object. Mix and match freely.

Layer 4 — Document Extraction & OCR Metrics

This is the most underrated part of the framework. Not everything you evaluate is an LLM.

If AWS Textract, Google Cloud Vision, or Azure Form Recognizer is in your pipeline, you need evaluation metrics for those outputs too:

  • text_extraction_accuracy — Fuzzy sequence similarity
  • character_error_rate (CER) — Standard OCR benchmarking metric
  • word_error_rate (WER) — Used in document parsing and speech-to-text
  • bounding_box_iou — Intersection over Union for spatial accuracy
  • field_extraction_f1 — Precision/recall for structured form fields
from custom.evals.metrics import text_extraction_accuracy, character_error_rate, bounding_box_iou

eval_input = {
    "output": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "expected": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "output_bbox": {"x": 10, "y": 20, "width": 100, "height": 30},
    "expected_bbox": {"x": 10, "y": 20, "width": 100, "height": 30}
}

print(f"Accuracy: {text_extraction_accuracy(eval_input).score:.2%}")  # 100.00%
print(f"CER: {character_error_rate(eval_input).metadata['raw_cer']:.2%}")  # 0.00%
print(f"IoU: {bounding_box_iou(eval_input).score:.2f}")                   # 1.00

Enter fullscreen mode Exit fullscreen mode

None of the pure-LLM evaluation frameworks address this. Custom Evals does.


17+ Framework Integrations — The Full Picture

The pattern is the same across every framework:

# 1. Run your framework-specific agent (different per framework)
result = your_agent.run(query)
response = extract_response(result)  # framework-specific extraction

# 2. Evaluate with Custom Evals (always the same)
eval_input = {
    "input": query,
    "output": response,
    "context": relevant_context,  # optional
    "expected": ground_truth_answer  # optional
}

score = evaluator.evaluate(eval_input)

Enter fullscreen mode Exit fullscreen mode

The eval_input dict is the universal adapter. Every integration reduces to filling this dict.

Here's a quick tour of what's covered:

☁️ Cloud Platforms

  • AWS Strands Agents (Bedrock + Claude)
  • Google ADK (Gemini 1.5 Flash/Pro)
  • Databricks Agent Bricks SDK (native MLflow experiment tracking included)

🏢 Microsoft Ecosystem

  • Microsoft Agent Framework
  • Semantic Kernel (plugin output boundaries)
  • Autogen (individual turns and full conversation outcomes)

🔗 LangChain & LlamaIndex

  • LangGraph (stateful graph evaluation)
  • LlamaIndex Workflows (event-driven hooks)
  • LangChain RAG + LlamaIndex RAG (full faithfulness/relevancy stack)

🤖 OpenAI

  • OpenAI Agents Framework (tool calls, handoffs)
  • OpenAI Agents SDK (function calling, structured outputs)
  • OpenAI Assistants (threads and run-based responses)
  • OpenAI Swarm (experimental multi-agent)

🌍 Community Frameworks

  • Agno (multi-agent at scale)
  • CrewAI (role-based agent outputs)
  • Pydantic AI (type-safe, structured outputs)

A Real Production Pipeline (Async + Concurrent)

Here's what running evaluation at scale actually looks like:

import asyncio
from custom.evals import (
    HallucinationEvaluator,
    FaithfulnessEvaluator,
    AnswerRelevancyEvaluator,
    RelevanceEvaluator
)
from custom.evals.llm import LLM
from custom.evals.metrics import bleu_score, rouge_n

async def evaluate_rag_batch(queries, rag_pipeline):
    llm = LLM(provider="openai", model="gpt-4o-mini")
    evaluators = {
        "hallucination": HallucinationEvaluator(llm),
        "faithfulness": FaithfulnessEvaluator(llm),
        "answer_relevancy": AnswerRelevancyEvaluator(llm),
        "relevance": RelevanceEvaluator(llm),
    }
    results = []

    for query in queries:
        response = rag_pipeline.query(query.text)
        eval_input = {
            "input": query.text,
            "output": response.answer,
            "context": "\n".join(response.source_nodes),
            "expected": query.expected_answer
        }

        # Run all LLM evaluations concurrently — not one-by-one
        llm_scores = await asyncio.gather(*[
            evaluators[name].evaluate_async(eval_input)
            for name in evaluators
        ])

        row = {"query": query.text}
        for name, score in zip(evaluators.keys(), llm_scores):
            row[f"{name}_score"] = score.score
            row[f"{name}_label"] = score.label

        # Add deterministic metrics at zero cost
        if query.expected_answer:
            row["bleu"] = bleu_score(eval_input).score
            row["rouge_1"] = rouge_n(eval_input).score

        results.append(row)

    return results

results = asyncio.run(evaluate_rag_batch(test_queries, my_rag_pipeline))

# Aggregate and report
import statistics
faithfulness_scores = [r["faithfulness_score"] for r in results]
print(f"Mean Faithfulness: {statistics.mean(faithfulness_scores):.3f}")
print(f"Failing: {sum(1 for s in faithfulness_scores if s < 0.7)}/{len(results)}")

Enter fullscreen mode Exit fullscreen mode

Concurrent async LLM calls + fast deterministic checks. That's how you run evaluation at scale without LLM serial bottlenecks.


The Ground Truth Problem (And How It's Handled)

Here's a question most eval frameworks dodge: what if I don't have ground truth?

In production, users ask unpredictable questions. You can't pre-label every possible answer. Custom Evals explicitly handles three scenarios:

Reference-free (no ground truth needed):

# Hallucination only requires context, not an expected answer
score = hallucination_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "context": "William Shakespeare wrote Hamlet circa 1600."
    # No "expected" key — still meaningful evaluation
})

Enter fullscreen mode Exit fullscreen mode

Soft ground truth (intent-based):

# Answer relevancy checks if the answer addresses the question's intent
score = answer_relevancy_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600."
    # No expected answer — evaluates relevance to the question
})

Enter fullscreen mode Exit fullscreen mode

Hard ground truth (known correct answers):

# Full correctness + BLEU/ROUGE when you have labeled data
score = correctness_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "expected": "William Shakespeare"
})

Enter fullscreen mode Exit fullscreen mode

This matters. Evaluation infrastructure that only works with labeled datasets is evaluation infrastructure you won't actually use in production.


Observability: Beyond Individual Scores

Custom Evals integrates with Phoenix Tracing (Arize) for production monitoring. One initialization line instruments everything:

from custom.evals import initialize_tracing

initialize_tracing(
    phoenix_endpoint="http://localhost:6006/v1/traces",
    metrics_enabled=True,
    metrics_export_interval=30000  # export every 30 seconds
)

Enter fullscreen mode Exit fullscreen mode

After this, every evaluator call automatically:

  • Creates an OpenTelemetry span with timing + attributes
  • Increments evaluation counters
  • Records score distributions
  • Computes P50/P95/P99 latency histograms

Real-time dashboards showing score distributions over time — proactive monitoring instead of reactive debugging.


Framework Comparison

Feature Custom Evals Phoenix Evals DeepEval RAGAS
Installation Low friction Medium Medium Low
Agent framework support 17+ Limited Limited Limited
LLM-as-judge metrics 6 Many 50+ 4
Deterministic NLP metrics 10+ Few Few Few
OCR/Document evaluation
OpenTelemetry tracing Optional Core
No backend required

The honest take: Custom Evals isn't trying to replace DeepEval or RAGAS. It's designed to be the evaluation layer you can plug into any stack. Run it alongside DeepEval for deeper metric coverage. Run it alongside Phoenix for full observability. It's composable by design.


Get Started in 5 Minutes

# Step 1: Install
pip install -e ".[dev]"

# Step 2: Set your API key
export OPENAI_API_KEY="sk-..."

Enter fullscreen mode Exit fullscreen mode

# Step 3: Run your first real evaluation
from custom.evals import HallucinationEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = HallucinationEvaluator(llm)

score = evaluator.evaluate({
    "input": "What year was Python created?",
    "output": "Python was created in 1991 by Guido van Rossum.",
    "context": "Python was created in 1991 by Guido van Rossum."
})

print(f"Result: {score.label}")        # factual
print(f"Score: {score.score}")         # 0.0 (no hallucination = good)
print(f"Reason: {score.explanation}")

Enter fullscreen mode Exit fullscreen mode

# Step 4: Add free deterministic checks
from custom.evals.metrics import bleu_score, rouge_n

for metric in [bleu_score, rouge_n]:
    result = metric({"output": my_answer, "expected": ground_truth})
    print(f"{result.name}: {result.score:.3f}")

Enter fullscreen mode Exit fullscreen mode


What's Coming Next

The roadmap has some meaningful additions in the pipeline:

  • Context Precision & Recall — The two RAGAS metrics that complete the standard RAG evaluation quadrant
  • Safety Metrics — Bias and toxicity detection
  • Agentic Metrics — Tool call correctness, task completion rate, step efficiency for multi-agent systems
  • Extended Provider Support — Cohere, Mistral, Ollama (the strategy pattern makes this straightforward)

Wrapping Up

The LLM evaluation space is fragmented. Teams are building on different stacks. Frameworks produce different output shapes. Use cases demand different metrics.

Custom Evals is an honest acknowledgment of that fragmentation — and a practical response to it.

It won't be the only eval library you ever use. It will be the one you can actually drop into any stack without rebuilding your infrastructure around it.

Because in a world where you're choosing between 17 agent frameworks on any given sprint, having a single evaluation interface that works across all of them isn't a nice-to-have.

It's the difference between knowing your agent works and hoping it does.


Resources & Links

  • 🔗 GitHub: anjijava16/cust-evals
  • 📖 Framework Index: docs/FRAMEWORK_INDEX.md — all 17+ integrations
  • 🚀 Quick Start: guides/QUICKSTART.md
  • 📄 Non-LLM Evaluation: docs/NON_LLM_EVALUATION_GUIDE.md — OCR and Textract guide
  • 📊 Advanced Metrics: docs/ADVANCED_METRICS_GUIDE.md — BLEU, ROUGE, and beyond

Building evaluation pipelines for AI systems? What metrics have you found most actionable in production? Drop your experience in the comments — genuinely curious what's working for others.