Data quality and rubrics: how to build trust in your models

Snorkel AI

Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman Code World Models and AutoHarness for LLM Agents Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory Building FinQA: An Open RL Environment for Financial Reasoning Agents How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks Coding agents don’t need to be perfect, they need to recover Closing the Evaluation Gap in Agentic AI SlopCodeBench: Measuring Code Erosion as Agents Iterate Introducing the Snorkel Agentic Coding Benchmark 2026: The year of environments Part V: Future Direction and Emerging Trends in Rubric-Based AI Evaluation The self-critique paradox: Why AI verification fails where it’s needed most Chat With the Terminal-Bench Team | Snorkel AI Intelligence per watt: A new metric for AI’s future Terminal-Bench 2.0: Raising the bar for AI agent evaluation Snorkeling in RL environments Introducing SnorkelSpatial: A Benchmark for LLM Spatial Reasoning Scaling Trust: Rubrics in Snorkel's Quality Process Evaluating Multi-Agent Systems in Enterprise Tool Use Evaluating Coding Agents with Terminal-Bench 2.0 Parsing isn’t neutral: why evaluation choices matter The science of rubric design The right tool for the job: An A-Z of rubrics Building the benchmark: inside our agentic insurance underwriting dataset Evaluating AI agents for insurance underwriting LLM observability: key practices, tools, and challenges Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI Data-centric development of an enterprise AI agent with Snorkel Building the data development platform for specialized AI LLM-as-a-judge for enterprises: evaluate model alignment at scale Why GenAI evaluation requires SME-in-the-loop for validation and trust Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation? Why enterprise GenAI evaluation requires fine-grained metrics to be insightful What is specialized GenAI evaluation, and why is it so critical to enterprise AI? LLM alignment techniques: 4 post-training approaches Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering? Why enterprises should embrace LLM distillation Retrieval-augmented generation (RAG) failure modes and how to fix them What is large language model (LLM) alignment? Databricks + Snorkel Flow: integrated, streamlined AI development How LLM evaluation drives better models in Snorkel Flow Unlock proprietary data with Snorkel Flow and Amazon SageMaker LLM evaluation in enterprise applications: a new era in ML Snorkel AI joins the AWS ISV Accelerate Program and launches Snorkel Flow Availability in AWS Marketplace AI data development: a guide for data science projects SnorkelCon 2024: Inaugural Snorkel AI user conference gathers leaders from 30+ Fortune 500 companies Snorkel Flow 2024.R3: Supercharge your AI development with enhanced data-centric workflows Explore the new GenAI Evaluation Suite: Snorkel 2024.R3 New NLP features in Snorkel Flow 2024.R3 Enterprise data compliance and security review: Snorkel Flow 2024.R3 How a global financial services company built a specialized AI copilot accurate enough for production Task Me Anything: innovating multimodal model benchmarks Alfred: Data labeling with foundation models and weak supervision RAG: LLM performance boost with retrieval-augmented generation Call center AI for customer experience management: a case study New GenAI features, data annotation: Snorkel Flow 2024.R2 How data slices transform enterprise LLM evaluation Meta’s Llama 3.1 405B is the new Mr. Miyagi, now what? Meta’s new Llama 3.1 models are here! Are you ready for it? Data-centric AI with Snorkel and MinIO Weak supervision for non-categorical applications + superalignment Snorkel AI signs strategic collaboration agreement with AWS to help enterprises cross the demo-to-production chasm AI alignment made simple: innovative solutions for businesses How does the Snorkel Flow label model work? Vision language models: how LLMs boost image classification Long context models in the enterprise: benchmarks and beyond How to build production-grade RAG retrieval with Snorkel Flow How Bonito helps fine-tune specialized LLMs faster than ever Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment Role-based access controls in Snorkel Flow secure enterprise data Accelerating AI development in manufacturing with Snorkel Flow and AWS SageMaker How ROBOSHOT boosts zero-shot foundation model performance Discover what’s new in Snorkel Flow: Flexible data and LLM connectivity, secure data controls, and more! Faster than ever document intelligence with new Snorkel Flow FM-first workflow The art of data development for Enterprise LLMs Crossing the demo-to-production chasm with Snorkel Custom How Snorkel topped the AlpacaEval leaderboard (and why we're not there anymore) CRFM's HELM and enterprise LLM evaluation beyond accuracy How we achieved 89% accuracy on contract question answering Five sessions not to miss at Google Cloud Next 24 Content filtering breakthrough: Snorkel client reaches 96% recall in 3 days Here's how Snorkel Flow + Google AI built an enterprise-ready model in a day Snorkel teams with Microsoft to showcase new AI research at NVIDIA GTC How Skill-it! enables faster, better LLM training Fine-tuned representation models boost LLM systems. Here's how Enterprise GenAI to surge in 2024: survey results Large language model training: how three training phases shape LLMs LoRA: Low-Rank Adaptation for LLMs LLM distillation demystified: a complete guide Enterprises must shift their focus from models to data in AI development Insurance’s GenAI revolution: a business perspective Scaling human preferences in AI: Snorkel's programmatic approach Building better enterprise AI: incorporating expert feedback in system development “Fall in love with your data”—Snorkel AI’s Enterprise LLM Summit Why QBE Ventures invested in Snorkel AI New benchmark results demonstrate value of Snorkel AI approach to LLM alignment Retrieval augmented generation (RAG): a conversation with its creator Snorkel Flow 2023.R4: enhanced UI + PDF and Databricks tools How Snorkel Flow users can register custom models to Databricks Stanford professor discusses exciting advances in foundation model evaluation

Timothy Speciale · 2025-07-29 · via Snorkel AI

Your AI model is only as good as your data! But how do you measure “good”?

While the AI industry races toward more sophisticated AI applications like agentic systems, a critical question remains top of mind: How do we systematically evaluate and improve the quality of training and evaluation data that powers these systems?

It’s time for enterprises investing in AI to adopt the state-of-the-art approach used by major AI labs and Snorkel: structured evaluation rubrics.

Welcome to Snorkel AI’s five-part blog series on rubrics.

Part 1 introduces rubric-based evaluation and how both automated and human evaluations benefit from using them.
Part 2 deep dives into different types of rubrics and where it makes sense to apply them. We discuss evaluating both final outcomes and an agent’s response steps (traces).
Part 3 explains the science of rubric design, backed by existing literature and Snorkel’s own projects.
Part 4 is another deep dive, this time into Snorkel’s own use of rubrics in a complex, multi-review process that integrates automated and human-in-the-loop evaluations that produce meaningful data outcomes and improvements to the rubrics over time.
Part 5 looks ahead to emerging trends, advanced techniques, and new AI use cases like agentic multi-turn, multi-step reasoning conversations and tool calls, and multi-modal and coding AI applications.

Evaluation matters, but existing methods fall short of real-world utility for generative applications

Evaluation provides clear metrics to track performance, uncover hidden issues early, and build confidence that your AI behaves as expected before it reaches users. Understanding the limitations of agentic systems is crucial for risk management in a successful production deployment. However, most organizations still rely on outdated evaluation methods.

The shift to generative models and agentic systems requires a similar shift in how we approach evaluation. The challenge is building evaluation frameworks that can assess open-ended, generative responses. Closed-ended benchmarks such as MMLU still matter for multiple-choice tasks, but we need a different lens for the open-ended, generative era.

One outdated method, ad hoc evaluation, relies on gut instincts and simple approaches that miss nuance and fail to capture critical edge cases at scale. Ad hoc checks fall short when assessing open‑ended generative outputs in specialized domains.

“Golden” responses, an approach that uses predefined, ideal answers as points of comparison, has proven brittle. The predefined responses quickly become outdated, or may not apply when there is no single correct response.

We need to shift our mental models of evaluation to include rubrics

The evaluation problem can’t be solved with better data alone.

To build evaluation frameworks that address the unique demands of the generative era, we need to fundamentally transform how we think about quality measurement itself.

Where gut instinct falls short and endless lists of “golden responses” fail to apply to the full variety of agentic interactions, rubrics are up to the challenge.

Ready to begin? Let’s dive in.

Part I: Introduction to rubrics for AI evaluation

The era of vibe-checking AI models—the “it looks good to me” approach—is over. As AI systems tackle increasingly complex, open-ended tasks such as generating code and conducting deep research, the industry is shifting from intuitive, ad hoc assessment methods to systematic, science-backed evaluation frameworks. What started as a necessity for major AI labs is fast becoming industry standard: structured rubric-based evaluation.

Rubric-based assessments are:

Reliable
Nuanced
Consistent between automated and human annotators
Granular enough to supply the feedback essential for continuous model improvement

Before we explore the evidence for how effective rubrics are for AI system evaluation, we’ll anchor on a clear definition of a rubric.

What is a rubric?

A rubric is a structured guide that spells out what “good” looks like for each response from an AI system.

A rubric consists of:

A list of criteria: Does the code compile? Does it have comments?
How the model performed on each criterion: “Compiles” could be yes/no. It could also be more nuanced: yes/yes with warnings/no.
Scoring rules that turn performance into numbers: Clean = 0. Warnings = 1. No = 2.

The final rubric score is a set of criteria, each associated with numbers or values. A rubric is a mechanism for embedding domain expertise in a checklist. For example, for a code-generating LLM, you’d want someone familiar with coding to decide which criteria to include and the relevant levels of performance for each. We’ll talk more about rubric design in Part 3 of this series.

Both humans and LLMs can use evaluation rubrics. Researchers and program managers can provide rubrics to human annotators, giving all annotators a shared understanding of the rating system for their dataset. The rubric helps to reduce bias and increase alignment between annotators. In automated judging (called LLM-as-a-judge evaluation), the annotator includes the rubric in the AI judge’s prompt.

For both human and automated evaluators, the rubric converts fuzzy expectations into repeatable scores that feed both data quality loops and live evaluation dashboards.

Here’s an example of what a rubric for a weekend planning agent could look like:

Prompt
Plan a two-day weekend trip for a solo traveler who loves surfing, prefers three-star hotels, and has a budget of up to $500 total. Include suggested itinerary items, lodging details, and approximate costs.

Criterion	Response Type	Description	Score
Budget Compliance	Yes or No	Is the total estimated cost within $500?	Yes / No
Hotel Preferences	Yes or No	Did the plan include three-star hotel options?	Yes / No
Surfing Focus	1 to 5	How well does the itinerary incorporate surfing activities?	1 (low) to 5 (high)
Clarity of Itinerary	1 to 5	Is the schedule clear, with times and locations specified?	1 (low) to 5 (high)
Variety of Activities	1 to 5	Does the plan include a balanced mix of leisure and adventure?	1 (low) to 5 (high)
Cost of Breakdown Detail	1 to 5	Are lodging, food, transportation and activity costs itemized?	1 (low) to 5 (high)

Here’s an example of applying the weekend planning rubric to a sample generated response:

Prompt
Plan a two-day weekend trip for a solo traveler who loves surfing, prefers three-star hotels, and has a budget of up to $500 total. Include suggested itinerary items, lodging details, and approximate costs.

Model Response
Day 1:
- Surf with Pros Lesson - $60
- Lunch at Taco Shack - $20
- Check-in at Seaview Inn (three-star) - $120/night

Day 2:
- Sunrise surf at Lighthouse Cove - free
- Coastal hike tour - $40
- Picnic lunch - $25

Total Estimated Cost: $295

Criterion	Description	Score	Rationale
Budget Compliance	Is the total estimated cost within $500?	Yes	Total of $295 leaves room under $500.
Hotel Preference	Did the plan include three-star hotel options?	Yes	Includes Seaview Inn, a confirmed three-star property
Surfing Focus	How well does the itinerary incorporate surfing activities?	5	Multiple surf sessions and a lesson cover core preference.
Clarity of Itinerary	Is the schedule clear, with times and locations specified?	4	Locations are linked, though exact items could be added.
Variety of Activities	Does the plan include a balanced mix of leisure and adventure?	3	Surfing and hike mix leisure and adventure, but more sightseeing could help.
Cost Breakdown Detail	Are lodging, food, transportation and activity costs itemized?	4	Most costs are listed, though transport fees are assumed free.

The evolution from simpler evaluation methods

For decades, AI evaluation relied heavily on intuitive assessment methods. These methods included:

Subject matter experts reviewing outputs against their unique intuition of the ideal output
Accuracy metrics assessed against static benchmarks
Comparisons against “golden datasets” for a finite set of examples

This approach worked reasonably well for classification tasks with clear right and wrong answers, but it fundamentally breaks down when applied to modern generative AI (GenAI) systems.

Consider the case of evaluating an AI-generated research summary, or a code solution with multiple valid approaches, or a conversational response where creativity and nuance matter as much as factual accuracy. Traditional metrics like BLEU scores and exact match comparisons fail to capture the multidimensional nature of quality in these open-ended scenarios, often missing critical aspects like coherence, helpfulness, or domain-specific requirements.

Leading AI organizations have moved beyond ad hoc evaluation toward systematic, multidimensional rubric-based frameworks. This shift was driven by necessity. As AI systems became more capable and deployed in more complex applications, the limitations of simple evaluation methods became glaring obstacles. These simple methods lacked either reliability or the ability to generate metrics quickly enough to create a useful iteration loop.

The ability to provide structured, detailed feedback across multiple criteria for each output of a GenAI system has become essential not just for model development, but for building the trust and reliability required for real-world deployment.

The evaluation rubric is doubly useful, because it secures labeling consistency among human annotators in the annotation phase, and later doubles as the blueprint for automated grading once the model is trained, closing the loop between data creation and evaluation.

Literature review: how the pros think about evaluation rubrics

The recent literature converges on a simple insight: treat the evaluation plan as the primary objective of the data and let the training pipeline enforce it. Evaluation, and evaluation rubrics, are essential at two foundation blocks in model development loop:

Curating high-quality training datasets; that is, making sure that what goes into making AI systems is up to spec.
Automating scoring of model responses; that is, making sure that what comes out of AI systems is also up to (the same) spec.

Here are examples of rubrics in use during the creation of high-quality training data:

Google’s ML Test Score [1] turns twenty-eight discrete checks into an executable scorecard that teams run on every pull request, a practice that generated double-digit accuracy gains and exposed silent data drift long before it reached production.
Microsoft’s RUBICON [2] extends that philosophy to conversational agents, using a large language model to propose many candidate criteria, then pruning them until the final rubric separates strong and weak dialogues with 18% better precision than baseline heuristics.
Databricks [3] echoes the same theme for retrieval-augmented generation, showing that an explicit rubric prompt fed to GPT-4 can grade thousands of question-answer pairs per hour while keeping expert agreement above 80%.

Speed alone is not enough; automated evaluators must also earn trust.

The Alternative Annotator Test [4] offers a statistical framework for deciding when an LLM can replace human raters, requiring only a small audit set to quantify risk.
Prometheus 2 [5] pushes that frontier further, matching human and GPT-4 preferences across both direct scoring and pairwise ranking, and letting practitioners swap new criteria in or out without a full reward model retrain.

Together, these studies demonstrate that scalable model iteration pipelines are possible. Efficient and effective enterprise AI development depends on these three qualities:

Scalability: Human effort is reserved for edge cases.
Reliability: Rubric questions are concrete enough for prompt-based scoring.
Alignment: Statistical safeguards confirm that automated judges remain aligned with expert opinion.

In practice, automated evaluation follows a layered approach:

An initial LLM-as-a-judge evaluation screens for obvious failures such as toxicity or broken formatting.
A rubric-based second pass scores surviving responses on factual accuracy, coherence, and domain alignment.
Periodically, humans perform spot checks to recalibrate the first two automated layers.

This structure is scalable because most items flow through the first two layers without human touch, reliable because the rubric prompt grounds the judge’s decisions in transparent criteria, and flexible because teams can edit or add rubrics as product goals evolve, all without redefining an immutable ground truth that never truly existed for many open-ended tasks.

To understand why rubrics work to align human annotators with LLM judges, we turn to an analysis of the cognition that takes place when humans use a rubric during manual annotation.

Psychology of human manual annotation

Rubric-guided evaluation reshapes how human annotators think. A clear rubric externalizes the criteria that would otherwise sit in working memory, letting raters focus on one dimension at a time and cutting the mental juggling that drives inconsistency.

Education studies report [6] sharper agreement when analytic rubrics replace holistic judgments, with human graders themselves crediting lower cognitive load for the improvement.
Recent benchmarking on explanation quality echoes that finding in AI datasets: the CUBE paper [7] showed that once annotators followed a structured rubric, agreement with expert adjudicators increased even on subjective tasks such as reasoning clarity and stance.
A 2024 experiment [8] went further, pairing crowd workers with GPT-4 labels. It found that the complementary strengths of humans and the model surfaced only after the workers aligned on the same rubric, pushing accuracy past either source alone.
Consistency needs proof, which is where inter-rater reliability metrics enter. Measures such as Cohen’s Kappa, Fleiss’ Kappa, and the intraclass correlation coefficient quantify how often raters converge beyond chance expectations [9].
Scale design matters. Most rubrics use five- or seven‑point Likert scales, but without clear anchors, these can introduce central tendency and interpretation bias. In fact, Best-Worst Scaling [10] has been shown to yield significantly higher inter‑rater reliability than traditional Likert ratings, and continuous (slider‑style) scales can further improve consistency over discrete options in dialogue evaluation tasks [11].

Rubrics do more than raise inter-annotator agreement; they also act as guardrails against bias and annotator fatigue.

The authors of [12] designed assessment rubrics for human evaluation of generated long-form answers to medical questions; the rubrics incorporated multiple dimensions of bias with the potential to contribute to equity-related harm, exposing failure modes missed by generic toxicity screens in LLMs.
On the more practical side, Pareto et al. shows [13] that well-structured instructions combined with task rotation slow the onset of annotation fatigue and maintain quality over many hours of work.

The same rubric that guides automated judges must first persuade human annotators to grade consistently. This guarantees that manual annotation becomes a controlled experiment rather than a leap of faith.

Conclusion

To conclude, the evidence argues for a layered, rubric-driven evaluation process:

Define rubrics that surface failure modes in both training data and model responses.
Train annotators until inter-rater scores stabilize.
Rely on automated judges for faster grading of responses during model iteration.
Run periodic expert spot checks to catch drift in both annotators and judges.

Following these steps delivers an evaluation methodology that provides the nuanced and multi-faceted metrics that apply to real-world scenarios, and reliably scales it up to the volume and pace needed for fast AI system development.

Check out Part 2 of our rubric series, where we unpack different types of rubrics and discuss when to apply them on various types of datasets.

At Snorkel AI, we don’t just write about high-quality evaluation, we deliver it. Through our Expert Data-as-a-Service, we partner with frontier model developers to curate world-class datasets for training and evaluating LLMs. Think: high-signal, expert-labeled data tailored to your most ambitious use cases for complex tool-using agents, reasoning-heavy workflows.

Interested in bringing rubric-based rigor to your models? Let’s talk.

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

References

Breck, Eric, et al. “The ML test score: A rubric for ML production readiness and technical debt reduction.” 2017 IEEE international conference on big data (big data). IEEE, 2017.
Biyani, Param, et al. “Rubicon: Rubric-based evaluation of domain-specific human ai conversations.” Proceedings of the 1st ACM International Conference on AI-Powered Software. 2024.
Leng, Quinn. “Best Practices for LLM Evaluation of RAG Applications.” Databricks, Databricks, 2023, www.databricks.com/blog/LLM-auto-eval-best-practices-RAG.
Calderon, Nitay, Roi Reichart, and Rotem Dror. “The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs.” arXiv preprint arXiv:2501.10970 (2025).
Kim, Seungone, et al. “Prometheus 2: An open source language model specialized in evaluating other language models.” arXiv preprint arXiv:2405.01535 (2024).
“Empirical exploration into academic grading and feedback approaches.” Edexia, 4 Apr. 2025, www.edexia.ai/news/empirical-explration-into-academic-grading-and-feedback-approaches.
Galvan-Sosa, Diana, et al. “Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset.” arXiv preprint arXiv:2503.23899 (2025).
He, Zeyu, et al. “If in a crowdsourced data annotation pipeline, a gpt-4.” Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 2024.
“Inter-rater Reliability: Definition, Examples, Calculation.” Encord, encord.com/blog/inter-rater-reliability/.
Kiritchenko, Svetlana, and Saif M. Mohammad. “Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation.” arXiv preprint arXiv:1712.01765 (2017).
Santhanam, Sashank, and Samira Shaikh. “Towards best experiment design for evaluating dialogue system output.” arXiv preprint arXiv:1909.10122 (2019).
Pfohl, Stephen R., et al. “A toolbox for surfacing health equity harms and biases in large language models.” Nature Medicine 30.12 (2024): 3590-3600.
Parti, Ayush. “Annotation fatigue: Why human data quality declines over time.” Pareto et al., 6 Feb. 2025, pareto.ai/blog/annotation-fatigue.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Snorkel AI

Evaluation matters, but existing methods fall short of real-world utility for generative applications

We need to shift our mental models of evaluation to include rubrics

Part I: Introduction to rubrics for AI evaluation

What is a rubric?

The evolution from simpler evaluation methods

Literature review: how the pros think about evaluation rubrics

Psychology of human manual annotation

Conclusion

Snorkel Expert Data-as-a-Service

References