Intro:
Automated evaluation is fast becoming a necessity as AI-driven agents proliferate across business processes. While accuracy and trust are always top of mind, manual review of agent responses simply doesn't scale. That’s where the idea of using a Large Language Model (LLM) as an impartial “judge” comes in—applying a purpose-built prompt to turn your LLM into a rigorous, step-by-step evaluator.
I've previously experimented with extraction and evaluation frameworks which focused on structured data and document extraction.
However, this article is centered on a different challenge: evaluating conversational, Retrieval-Augmented Generation (RAG) based agents.
Let me share a battle-tested evaluation prompt designed for these agents. Let me also break down the logic, metrics, and final grading criteria, along with sample input/output. This approach fits naturally into Power Platform AI Builder scenarios, enabling scalable, explainable, and reproducible evaluation.
The prompt:
This prompt is designed for rapid, scalable evaluation of AI responses based solely on the user’s question—prioritizing task relevance, clarity, professionalism, and formatting compliance. The clear rubric, required reasoning, and structured output make it ideal for automated testing and continuous improvement in conversational AI systems.
| ** Characteristic ** |
** Description ** |
|---|---|
|
Evaluator Role |
LLM is positioned as an impartial, expert critic to ensure objectivity. |
|
Input Scope |
Strictly evaluates the AI response based only on the User Question (no external gold standard/reference). |
|
Metrics (Rubric) |
Four focused metrics: 1. Task Fulfillment (1–5) 2. Conciseness (1–5) 3. Professional Tone (0/1) 4. Formatting (0/1) |
|
Step-by-Step Reasoning |
Requires reasoning to be explained before each score, enhancing transparency. |
|
Scoring System |
Combination of graded (1–5) and binary (0/1) scoring for nuanced yet decisive evaluation. |
|
Pass/Fail Gating |
Strict thresholds: - Task Fulfillment ≥ 4 - Professional Tone = 1 - Formatting = 1 All must be met for PASS. |
|
Output Format |
Returns evaluation as a structured JSON object for easy automation and integration. |
|
Diagnostic Feedback |
Provides a one-sentence summary and, if FAIL, specifies exactly which threshold(s) were breached. |
|
Formatting Compliance |
Explicitly checks if the response adheres to any formatting instructions given in the user question. |
You are an impartial, expert Evaluation AI. Your task is to act as a "Critic" and evaluate an AI-generated response based strictly on the User Question provided.
You will evaluate the response across four specific metrics. For each metric, you must provide a brief step-by-step reasoning BEFORE assigning a score.
THE RUBRIC:
Metric 1: Task Fulfillment (Score: 1 to 5)
* How well does the response address the specific User Question?
* 5 = Comprehensive and perfectly tailored. 3 = Partially answers. 1 = Fails to address the question.
Metric 2: Conciseness (Score: 1 to 5)
* Is the response highly efficient with its words?
* 5 = Dense and to the point. 1 = Rambling, repetitive, or includes unnecessary fluff.
Metric 3: Professional Tone (Score: 0 or 1)
* Is the tone strictly professional, objective, and helpful?
* 1 = Pass. 0 = Fail (Emotional, sarcastic, overly informal, or rude).
Metric 4: Formatting (Score: 0 or 1)
* Did the response follow any explicit formatting instructions requested by the user (e.g., "bullet points", "JSON", "short paragraph")?
* 1 = Pass (or no formatting was requested). 0 = Fail (Explicit formatting instructions were ignored).
FINAL GRADING CRITERIA:
You must assign a final pipeline status of either "PASS" or "FAIL".
To achieve a "PASS", the report MUST meet ALL of the following conditions:
- Task Fulfillment must be 4 or 5.
- Professional Tone must be 1.
- Formatting must be 1.
INPUT DATA:
<user_question>
{{USER_QUESTION}}
</user_question>
<response_to_evaluate>
{{LLM_RESPONSE}}
</response_to_evaluate>
OUTPUT FORMAT:
Output your evaluation strictly as a valid JSON object.
{
"evaluation_report": {
"metrics": {
"task_fulfillment": {"reasoning": "...", "score": <1 to 5>},
"conciseness": {"reasoning": "...", "score": <1 to 5>},
"professional_tone": {"reasoning": "...", "score": <0 or 1>},
"formatting": {"reasoning": "...", "score": <0 or 1>}
},
"summary": {
"overall_verdict": "<One-sentence summary>",
"pipeline_status": "<PASS or FAIL>",
"failure_reason": "<If FAIL, state which threshold was breached.>"
}
}
}
LLM Output:
{
"evaluation_report": {
"metrics": {
"factual_consistency": {"reasoning": "...", "score": 1},
"entity_fabrication": {"reasoning": "...", "score": 1},
"citation_traceability": {"reasoning": "...", "score": 1},
"reference_alignment": {"reasoning": "...", "score": 5},
"task_fulfillment": {"reasoning": "...", "score": 5},
"completeness": {"reasoning": "...", "score": 5},
"conciseness": {"reasoning": "...", "score": 5}
},
"summary": {
"overall_verdict": "The agent response matched all core criteria and was well-supported.",
"pipeline_status": "PASS",
"failure_reason": null
}
}
}
Closing thoughts:
'LLM-as-a-Judge' evaluation prompt is designed to automate the QA of our conversational agent. What I learned from this framework is highly valuable because it moves us beyond brittle, exact-word-match testing and allows for nuanced, semantic evaluation.
By using a structured rubric, the Judge grades the agent’s responses against our 'gold standard' answers while strictly penalizing hallucinations, contradictions, and missing citations. Crucially, it outputs a deterministic, machine-readable JSON with a strict Pass/Fail threshold—meaning we can plug this directly into our automated testing pipeline to catch inaccurate or unsafe responses at scale before they ever reach the user.
Reference and Inspiration:
1 - A Survey on LLM-as-a-Judge
2 - Human-like Summarization Evaluation with ChatGPT
3 - Shepherd: A Critic for Language Model Generation
4 - The Rubric Revolution
PS - In the next article, I’ll show how we integrated this evaluation approach into a Power Automate Cloud Flow, enabling automated, real-time agent assessment with zero manual intervention.




















