惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

How to use Claude in vscode? Orakle: Turning Raw Blockchain Data into Intelligence with Gemma 4 Building an Autoposting Pipeline with Hermes Agent: Why Waterfall Beats Parallel, and the Edge Cases Nobody Talks About OpenShift Virtualization Migration Advisor — Local-First, Powered by Gemma 4 26B MoE WebMCP is coming — so I’m building webmcp.js I Disappeared for 4 Months After Launch - Here's What Brought Me Back Jira Is Turing-Complete (And You've Been Coding in It) NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive E-commerce Order Automation: Stripe + Invoice + Shipping Workflow How to Evaluate AI Agents: LLM-as-Judge Tutorial The Interview Prep Stack I Used as a Senior Software Engineer Targeting Big Tech Gemma4 Challenge OptiLearn - Powered by Google Gemma 4 Aura — The Gemma 4 Powered Agentic Web Copilot & Self-Healing Accessibility Engine I built a tool that catches misleading charts using Gemma 4 running locally Worklog companion with Gemma4 GBase: Building LLM Agents That Actually Learn from Their Mistakes Blossom — a small step toward student mental wellbeing WordPress Performance Monitoring: A Complete Guide Principal Components in TypeScript (Part 4) When three sharp wallets agree: what consensus signals on Polymarket actually mean I Built a Fail-Fast Rust Scheduler with Background OAuth Auto-Refresh (Part 2) Sharing is caring How Putting Faces (Literally) to My AI Garden Images Gave It a Personality Sofi Log #001: Thailand's Tourism Tax & the 180-Day AI Surveillance Wall Sofi Log #006: Decentralized IP-Address Obfuscation Specs Sofi Log #008: Bypassing Legacy Cross-Border Bank Fee Traps Secret Rotation Automation: The Operational Cost of Security Sofi Log #009: Portable Identity & DID Passport Framework Sofi Log #011: Autonomous Smart Treasury Repatriation Specs History of Linux & Unix I asked Claude if my plan was on track for the goal — and got an honest 'No' PHPStan 'expects X, Y given' — the trace it doesn't give you Using Gemma4 2B to Assist Community Health Workers Open-source Playwright wrapper that passes bot.sannysoft.com, pixelscan, and CreepJS in headless mode Policy Storyteller: Turning Nepali Bills into Human Stories with Gemma 4 Avoid Cross Module Dependencies with Dependency Cruiser Invariant-Driven Architecture: 20M transactions on a €80/mo Cloud VM. Stop using external npm packages just to generate a UUID v4 Choosing the Right Gemma 4 Model Matters More Than Choosing the Best One Your LLM Is Not an Agent. Your Framework Is Not Enough. You Need a Harness. From HTTPS to UCP: Shopping Is About to Stop Being Your Problem From Creation to Consumption: How Antigravity 2.0 and Gemini Spark Are Defining the Agentic Era 10 Mistakes I Wish I Knew Before Taking the CKA Exam AI That Actually Does Stuff: Autonomous Agents Explained Exploring AI workflow Orchestration: Comparing Weft, Python & Alternative Pipeline Approaches El Poder del Aprendizaje Federado: Cuando los Algoritmos Distribuidos Entrenan a la IA Email Marketing Automation in 2026: 5 Tools (and 1 Self-Hosted) Through Their APIs A Replay Runbook For Missed Publishing Windows Why timeout handling matters more than most backend logic How I Make $6,800/Month Selling Niche VS Code Extensions Model Routing Cost Checklist: Hosted APIs, Open Models, Or Self-Hosted Inference? ORA-00207 오류 원인과 해결 방법 완벽 가이드 Deno 2.8 Operator Upgrade Checklist: CI, Lockfiles, Node Compatibility, And Rollback AI-Discovered Vulnerabilities Need A Triage Queue, Not A Panic Channel AI Agent Workboards Need Audit Controls Before They Need More Agents Demystifying DevRel: What It Actually Is (And Why Should You Become One?) Your AI, Your Device, Your Data - Introducing Aide Gemma 4 GenAI Coach - GenAI Concepts Made Easy with an Interactive Playground QuietPulse - Mood Tracker Principal Components in TypeScript (Part 3) The pgAudit Attribution Gap: Why Role-Level Logging Fails GDPR and How to Close It Gemma 4 CAD Orchestrator I built a local Postgres triage co-pilot because HIPAA says I can't paste plans into ChatGPT or Claude Live Holographic Editor In Fractal Time Everbench: A document management system with Local Intelligence Instanton in Fractal Time The Hidden Features of Claude How I Built an AI News Brief with Next.js, Supabase, Vercel, and GPT-4o-mini How We Built a Multi-Agent AI Documentation System (And What We Learned) I got tired of writing post-mortems — so I built RCAi for SREs MIA: A Futuristic AI Desktop Assistant Built with Voice, Gestures, and Controlled Chaos Best Programming Language for Backend Web Development: PHP vs Python PayPal Alternatives for Indian Businesses: Best Payment Gateways for International Card Payments (2026) Gemma 4 Made Me Rethink Local AI: Not Just Text, But Images Too Clean Architecture in .NET Explained (The Dependency Rule) I Compiled Rust to WebAssembly and Made My JavaScript 6 Faster Outlook.com Is the Final Boss of 'Just Send an Email' Conditional Statements and Control Flow in Python Insults & Cutlasses, Local LLM Sword Fighting on Melee Island Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter How 12 AI agent frameworks handle human approval (most badly) The Four-Index Reality: Why AI Search Isn't One Thing I Scanned 1 Million AI Services. Here's What Worries Me More Than the Vulnerabilities Managing multiple docker hub accounts using docker-use System Design Interview: Decentralized Web Crawler Metric Cardinality: High or Low? 4 Steps to Making the Right Choice 로컬 LLM 셋업 가이드 (v23) GEO vs SEO in 2026 — What Google's May Guidance Changed Cursor Review 2026 — Honest 'Not For Me' Take From a VSCode User Hello from rikuq — a practitioner blog for solo AI SaaS founders Why DevOps Engineers Need Practical Tutorials, Not Just Theory AI Agents in CI/CD: Give Them Context, Not Production Authority Now I See Why Translators Are Panicking Over AI—Should Coders Panic Too? Why I Track HRV Every Morning (And How It Actually Changes My Day) Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation Chatbots GPT pour le support client : ce que les équipes françaises ont réellement besoin de savoir I Hit the 1,232-Byte Wall So You Don't Have To Google Just Rebuilt the Search Box (Again) — But This Time It's Different Aether: A local Android assistant built with Gemma 4
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
Bala Madhuso · 2026-05-25 · via DEV Community

Intro:
Automated evaluation is fast becoming a necessity as AI-driven agents proliferate across business processes. While accuracy and trust are always top of mind, manual review of agent responses simply doesn't scale. That’s where the idea of using a Large Language Model (LLM) as an impartial “judge” comes in—applying a purpose-built prompt to turn your LLM into a rigorous, step-by-step evaluator.
I've previously experimented with extraction and evaluation frameworks which focused on structured data and document extraction.

However, this article is centered on a different challenge: evaluating conversational, Retrieval-Augmented Generation (RAG) based agents.

Let me share a battle-tested evaluation prompt designed for these agents. Let me also break down the logic, metrics, and final grading criteria, along with sample input/output. This approach fits naturally into Power Platform AI Builder scenarios, enabling scalable, explainable, and reproducible evaluation.

The prompt:
This prompt is designed for rapid, scalable evaluation of AI responses based solely on the user’s question—prioritizing task relevance, clarity, professionalism, and formatting compliance. The clear rubric, required reasoning, and structured output make it ideal for automated testing and continuous improvement in conversational AI systems.

**
Characteristic **
**
Description **

Evaluator Role

LLM is positioned as an impartial, expert critic to ensure objectivity.

Input Scope

Strictly evaluates the AI response based only on the User Question (no external gold standard/reference).

Metrics (Rubric)

Four focused metrics:
1. Task Fulfillment (1–5)
2. Conciseness (1–5)
3. Professional Tone (0/1)
4. Formatting (0/1)

Step-by-Step Reasoning

Requires reasoning to be explained before each score, enhancing transparency.

Scoring System

Combination of graded (1–5) and binary (0/1) scoring for nuanced yet decisive evaluation.

Pass/Fail Gating

Strict thresholds:
- Task Fulfillment ≥ 4
- Professional Tone = 1
- Formatting = 1
All must be met for PASS.

Output Format

Returns evaluation as a structured JSON object for easy automation and integration.

Diagnostic Feedback

Provides a one-sentence summary and, if FAIL, specifies exactly which threshold(s) were breached.

Formatting Compliance

Explicitly checks if the response adheres to any formatting instructions given in the user question.
You are an impartial, expert Evaluation AI. Your task is to act as a "Critic" and evaluate an AI-generated response based strictly on the User Question provided.

You will evaluate the response across four specific metrics. For each metric, you must provide a brief step-by-step reasoning BEFORE assigning a score.

THE RUBRIC:

Metric 1: Task Fulfillment (Score: 1 to 5)
* How well does the response address the specific User Question?
* 5 = Comprehensive and perfectly tailored. 3 = Partially answers. 1 = Fails to address the question.

Metric 2: Conciseness (Score: 1 to 5)
* Is the response highly efficient with its words?
* 5 = Dense and to the point. 1 = Rambling, repetitive, or includes unnecessary fluff.

Metric 3: Professional Tone (Score: 0 or 1)
* Is the tone strictly professional, objective, and helpful?
* 1 = Pass. 0 = Fail (Emotional, sarcastic, overly informal, or rude).

Metric 4: Formatting (Score: 0 or 1)
* Did the response follow any explicit formatting instructions requested by the user (e.g., "bullet points", "JSON", "short paragraph")?
* 1 = Pass (or no formatting was requested). 0 = Fail (Explicit formatting instructions were ignored).

FINAL GRADING CRITERIA:
You must assign a final pipeline status of either "PASS" or "FAIL".
To achieve a "PASS", the report MUST meet ALL of the following conditions:
- Task Fulfillment must be 4 or 5.
- Professional Tone must be 1.
- Formatting must be 1.

INPUT DATA:
<user_question>
{{USER_QUESTION}}
</user_question>

<response_to_evaluate>
{{LLM_RESPONSE}}
</response_to_evaluate>

OUTPUT FORMAT:
Output your evaluation strictly as a valid JSON object.
{
  "evaluation_report": {
    "metrics": {
      "task_fulfillment": {"reasoning": "...", "score": <1 to 5>},
      "conciseness": {"reasoning": "...", "score": <1 to 5>},
      "professional_tone": {"reasoning": "...", "score": <0 or 1>},
      "formatting": {"reasoning": "...", "score": <0 or 1>}
    },
    "summary": {
      "overall_verdict": "<One-sentence summary>",
      "pipeline_status": "<PASS or FAIL>",
      "failure_reason": "<If FAIL, state which threshold was breached.>"
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

LLM Output:

{
  "evaluation_report": {
    "metrics": {
      "factual_consistency": {"reasoning": "...", "score": 1},
      "entity_fabrication": {"reasoning": "...", "score": 1},
      "citation_traceability": {"reasoning": "...", "score": 1},
      "reference_alignment": {"reasoning": "...", "score": 5},
      "task_fulfillment": {"reasoning": "...", "score": 5},
      "completeness": {"reasoning": "...", "score": 5},
      "conciseness": {"reasoning": "...", "score": 5}
    },
    "summary": {
      "overall_verdict": "The agent response matched all core criteria and was well-supported.",
      "pipeline_status": "PASS",
      "failure_reason": null
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

Closing thoughts:
'LLM-as-a-Judge' evaluation prompt is designed to automate the QA of our conversational agent. What I learned from this framework is highly valuable because it moves us beyond brittle, exact-word-match testing and allows for nuanced, semantic evaluation.

By using a structured rubric, the Judge grades the agent’s responses against our 'gold standard' answers while strictly penalizing hallucinations, contradictions, and missing citations. Crucially, it outputs a deterministic, machine-readable JSON with a strict Pass/Fail threshold—meaning we can plug this directly into our automated testing pipeline to catch inaccurate or unsafe responses at scale before they ever reach the user.

Reference and Inspiration:
1 - A Survey on LLM-as-a-Judge
2 - Human-like Summarization Evaluation with ChatGPT
3 - Shepherd: A Critic for Language Model Generation
4 - The Rubric Revolution

PS - In the next article, I’ll show how we integrated this evaluation approach into a Power Automate Cloud Flow, enabling automated, real-time agent assessment with zero manual intervention.