惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Snorkel AI

Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman Code World Models and AutoHarness for LLM Agents Why coding agents need better data, evals, and environments Why coding agents need better data, evals, and environments Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Benchmarks should shape the frontier, not just measure it Benchmarks should shape the frontier, not just measure it Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory Building FinQA: An Open RL Environment for Financial Reasoning Agents How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks Coding agents don’t need to be perfect, they need to recover Closing the Evaluation Gap in Agentic AI SlopCodeBench: Measuring Code Erosion as Agents Iterate Introducing the Snorkel Agentic Coding Benchmark 2026: The year of environments Part V: Future direction and emerging trends The self-critique paradox: Why AI verification fails where it’s needed most A chat with the Terminal-Bench team Intelligence per watt: A new metric for AI’s future Terminal-Bench 2.0: Raising the bar for AI agent evaluation Snorkeling in RL environments Introducing SnorkelSpatial Scaling trust: rubrics in Snorkel’s quality process Evaluating multi-agent systems in enterprise tool use Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark Parsing isn’t neutral: why evaluation choices matter The science of rubric design The right tool for the job: An A-Z of rubrics Building the benchmark: inside our agentic insurance underwriting dataset Evaluating AI agents for insurance underwriting LLM observability: key practices, tools, and challenges Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI Data-centric development of an enterprise AI agent with Snorkel Building the data development platform for specialized AI LLM-as-a-judge for enterprises: evaluate model alignment at scale Why GenAI evaluation requires SME-in-the-loop for validation and trust Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation? Why enterprise GenAI evaluation requires fine-grained metrics to be insightful What is specialized GenAI evaluation, and why is it so critical to enterprise AI? LLM alignment techniques: 4 post-training approaches Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering? Why enterprises should embrace LLM distillation Retrieval-augmented generation (RAG) failure modes and how to fix them What is large language model (LLM) alignment? Databricks + Snorkel Flow: integrated, streamlined AI development How LLM evaluation drives better models in Snorkel Flow Unlock proprietary data with Snorkel Flow and Amazon SageMaker LLM evaluation in enterprise applications: a new era in ML Snorkel AI joins the AWS ISV Accelerate Program and launches Snorkel Flow Availability in AWS Marketplace AI data development: a guide for data science projects SnorkelCon 2024: Inaugural Snorkel AI user conference gathers leaders from 30+ Fortune 500 companies Snorkel Flow 2024.R3: Supercharge your AI development with enhanced data-centric workflows Explore the new GenAI Evaluation Suite: Snorkel 2024.R3 New NLP features in Snorkel Flow 2024.R3 Enterprise data compliance and security review: Snorkel Flow 2024.R3 How a global financial services company built a specialized AI copilot accurate enough for production Task Me Anything: innovating multimodal model benchmarks Alfred: Data labeling with foundation models and weak supervision RAG: LLM performance boost with retrieval-augmented generation Call center AI for customer experience management: a case study New GenAI features, data annotation: Snorkel Flow 2024.R2 How data slices transform enterprise LLM evaluation Meta’s Llama 3.1 405B is the new Mr. Miyagi, now what? Meta’s new Llama 3.1 models are here! Are you ready for it? Data-centric AI with Snorkel and MinIO Weak supervision for non-categorical applications + superalignment Snorkel AI signs strategic collaboration agreement with AWS to help enterprises cross the demo-to-production chasm AI alignment made simple: innovative solutions for businesses How does the Snorkel Flow label model work? Vision language models: how LLMs boost image classification Long context models in the enterprise: benchmarks and beyond How to build production-grade RAG retrieval with Snorkel Flow How Bonito helps fine-tune specialized LLMs faster than ever Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment Role-based access controls in Snorkel Flow secure enterprise data Accelerating AI development in manufacturing with Snorkel Flow and AWS SageMaker How ROBOSHOT boosts zero-shot foundation model performance Discover what’s new in Snorkel Flow: Flexible data and LLM connectivity, secure data controls, and more! Faster than ever document intelligence with new Snorkel Flow FM-first workflow The art of data development for Enterprise LLMs Crossing the demo-to-production chasm with Snorkel Custom How Snorkel topped the AlpacaEval leaderboard (and why we’re not there anymore) CRFM’s HELM and enterprise LLM evaluation beyond accuracy How we achieved 89% accuracy on contract question answering Five sessions not to miss at Google Cloud Next 24 Content filtering breakthrough: Snorkel client reaches 96% recall in 3 days Here’s how Snorkel Flow + Google AI built an enterprise-ready model in a day Snorkel teams with Microsoft to showcase new AI research at NVIDIA GTC How Skill-it! enables faster, better LLM training Fine-tuned representation models boost LLM systems. Here’s how Enterprise GenAI to surge in 2024: survey results Large language model training: how three training phases shape LLMs LoRA: Low-Rank Adaptation for LLMs LLM distillation demystified: a complete guide Enterprises must shift their focus from models to data in AI development Insurance’s GenAI revolution: a business perspective Scaling human preferences in AI: Snorkel’s programmatic approach Building better enterprise AI: incorporating expert feedback in system development “Fall in love with your data”—Snorkel AI’s Enterprise LLM Summit
Data quality and rubrics: how to build trust in your models
2025-07-29 · via Snorkel AI

Your AI model is only as good as your data! But how do you measure “good”?

While the AI industry races toward more sophisticated AI applications like agentic systems, a critical question remains top of mind: How do we systematically evaluate and improve the quality of training and evaluation data that powers these systems?

It’s time for enterprises investing in AI to adopt the state-of-the-art approach used by major AI labs and Snorkel: structured evaluation rubrics.

Welcome to Snorkel AI’s five-part blog series on rubrics.

  1. Part 1 introduces rubric-based evaluation and how both automated and human evaluations benefit from using them.
  2. Part 2 deep dives into different types of rubrics and where it makes sense to apply them. We discuss evaluating both final outcomes and an agent’s response steps (traces).
  3. Part 3 explains the science of rubric design, backed by existing literature and Snorkel’s own projects.
  4. Part 4 is another deep dive, this time into Snorkel’s own use of rubrics in a complex, multi-review process that integrates automated and human-in-the-loop evaluations that produce meaningful data outcomes and improvements to the rubrics over time.
  5. Part 5 looks ahead to emerging trends, advanced techniques, and new AI use cases like agentic multi-turn, multi-step reasoning conversations and tool calls, and multi-modal and coding AI applications.

Evaluation matters, but existing methods fall short of real-world utility for generative applications

Evaluation provides clear metrics to track performance, uncover hidden issues early, and build confidence that your AI behaves as expected before it reaches users. Understanding the limitations of agentic systems is crucial for risk management in a successful production deployment. However, most organizations still rely on outdated evaluation methods.

The shift to generative models and agentic systems requires a similar shift in how we approach evaluation. The challenge is building evaluation frameworks that can assess open-ended, generative responses. Closed-ended benchmarks such as MMLU still matter for multiple-choice tasks, but we need a different lens for the open-ended, generative era.

One outdated method, ad hoc evaluation, relies on gut instincts and simple approaches that miss nuance and fail to capture critical edge cases at scale. Ad hoc checks fall short when assessing open‑ended generative outputs in specialized domains.

“Golden” responses, an approach that uses predefined, ideal answers as points of comparison, has proven brittle. The predefined responses quickly become outdated, or may not apply when there is no single correct response.

We need to shift our mental models of evaluation to include rubrics

The evaluation problem can’t be solved with better data alone.

To build evaluation frameworks that address the unique demands of the generative era, we need to fundamentally transform how we think about quality measurement itself.

Where gut instinct falls short and endless lists of “golden responses” fail to apply to the full variety of agentic interactions, rubrics are up to the challenge.

Ready to begin? Let’s dive in.

Part I: Introduction to rubrics for AI evaluation

The era of vibe-checking AI models—the “it looks good to me” approach—is over. As AI systems tackle increasingly complex, open-ended tasks such as generating code and conducting deep research, the industry is shifting from intuitive, ad hoc assessment methods to systematic, science-backed evaluation frameworks. What started as a necessity for major AI labs is fast becoming industry standard: structured rubric-based evaluation

Rubric-based assessments are:

  • Reliable
  • Nuanced
  • Consistent between automated and human annotators
  • Granular enough to supply the feedback essential for continuous model improvement

Before we explore the evidence for how effective rubrics are for AI system evaluation, we’ll anchor on a clear definition of a rubric.

What is a rubric?

A rubric is a structured guide that spells out what “good” looks like for each response from an AI system. 

A rubric consists of:

  • A list of criteria: Does the code compile? Does it have comments?
  • How the model performed on each criterion: “Compiles” could be yes/no. It could also be more nuanced: yes/yes with warnings/no.
  • Scoring rules that turn performance into numbers: Clean = 0. Warnings = 1. No = 2.

The final rubric score is a set of criteria, each associated with numbers or values. A rubric is a mechanism for embedding domain expertise in a checklist. For example, for a code-generating LLM, you’d want someone familiar with coding to decide which criteria to include and the relevant levels of performance for each. We’ll talk more about rubric design in Part 3 of this series.

Both humans and LLMs can use evaluation rubrics. Researchers and program managers can provide rubrics to human annotators, giving all annotators a shared understanding of the rating system for their dataset. The rubric helps to reduce bias and increase alignment between annotators. In automated judging (called LLM-as-a-judge evaluation), the annotator includes the rubric in the AI judge’s prompt.

For both human and automated evaluators, the rubric converts fuzzy expectations into repeatable scores that feed both data quality loops and live evaluation dashboards. 

Here’s an example of what a rubric for a weekend planning agent could look like:

Prompt
Plan a two-day weekend trip for a solo traveler who loves surfing, prefers three-star hotels, and has a budget of up to $500 total. Include suggested itinerary items, lodging details, and approximate costs.
CriterionResponse TypeDescriptionScore
Budget ComplianceYes or NoIs the total estimated cost within $500?Yes / No
Hotel PreferencesYes or NoDid the plan include three-star hotel options?Yes / No
Surfing Focus1 to 5How well does the itinerary incorporate surfing activities?1 (low) to 5 (high)
Clarity of Itinerary1 to 5Is the schedule clear, with times and locations specified?1 (low) to 5 (high)
Variety of Activities1 to 5Does the plan include a balanced mix of leisure and adventure?1 (low) to 5 (high)
Cost of Breakdown Detail1 to 5Are lodging, food, transportation and activity costs itemized?1 (low) to 5 (high)

Here’s an example of applying the weekend planning rubric to a sample generated response:

Prompt
Plan a two-day weekend trip for a solo traveler who loves surfing, prefers three-star hotels, and has a budget of up to $500 total. Include suggested itinerary items, lodging details, and approximate costs.
Model Response
Day 1:
- Surf with Pros Lesson - $60
- Lunch at Taco Shack - $20
- Check-in at Seaview Inn (three-star) - $120/night

Day 2:
- Sunrise surf at Lighthouse Cove - free
- Coastal hike tour - $40
- Picnic lunch - $25

Total Estimated Cost: $295

CriterionDescriptionScoreRationale
Budget ComplianceIs the total estimated cost within $500?YesTotal of $295 leaves room under $500.
Hotel PreferenceDid the plan include three-star hotel options?YesIncludes Seaview Inn, a confirmed three-star property
Surfing FocusHow well does the itinerary incorporate surfing activities?5Multiple surf sessions and a lesson cover core preference.
Clarity of ItineraryIs the schedule clear, with times and locations specified?4Locations are linked, though exact items could be added.
Variety of ActivitiesDoes the plan include a balanced mix of leisure and adventure?3Surfing and hike mix leisure and adventure, but more sightseeing could help.
Cost Breakdown DetailAre lodging, food, transportation and activity costs itemized?4Most costs are listed, though transport fees are assumed free.

The evolution from simpler evaluation methods

For decades, AI evaluation relied heavily on intuitive assessment methods. These methods included:

  • Subject matter experts reviewing outputs against their unique intuition of the ideal output
  • Accuracy metrics assessed against static benchmarks
  • Comparisons against “golden datasets” for a finite set of examples

This approach worked reasonably well for classification tasks with clear right and wrong answers, but it fundamentally breaks down when applied to modern generative AI (GenAI) systems.

Consider the case of evaluating an AI-generated research summary, or a code solution with multiple valid approaches, or a conversational response where creativity and nuance matter as much as factual accuracy. Traditional metrics like BLEU scores and exact match comparisons fail to capture the multidimensional nature of quality in these open-ended scenarios, often missing critical aspects like coherence, helpfulness, or domain-specific requirements.

Leading AI organizations have moved beyond ad hoc evaluation toward systematic, multidimensional rubric-based frameworks. This shift was driven by necessity. As AI systems became more capable and deployed in more complex applications, the limitations of simple evaluation methods became glaring obstacles. These simple methods lacked either reliability or the ability to generate metrics quickly enough to create a useful iteration loop.

The ability to provide structured, detailed feedback across multiple criteria for each output of a GenAI system has become essential not just for model development, but for building the trust and reliability required for real-world deployment.

The evaluation rubric is doubly useful, because it secures labeling consistency among human annotators in the annotation phase, and later doubles as the blueprint for automated grading once the model is trained, closing the loop between data creation and evaluation.

Literature review: how the pros think about evaluation rubrics

The recent literature converges on a simple insight: treat the evaluation plan as the primary objective of the data and let the training pipeline enforce it. Evaluation, and evaluation rubrics, are essential at two foundation blocks in model development loop:

  • Curating high-quality training datasets; that is, making sure that what goes into making AI systems is up to spec.
  • Automating scoring of model responses; that is, making sure that what comes out of AI systems is also up to (the same) spec.

Here are examples of rubrics in use during the creation of high-quality training data:

  • Google’s ML Test Score [1] turns twenty-eight discrete checks into an executable scorecard that teams run on every pull request, a practice that generated double-digit accuracy gains and exposed silent data drift long before it reached production.
  • Microsoft’s RUBICON [2] extends that philosophy to conversational agents, using a large language model to propose many candidate criteria, then pruning them until the final rubric separates strong and weak dialogues with 18% better precision than baseline heuristics.
  • Databricks [3] echoes the same theme for retrieval-augmented generation, showing that an explicit rubric prompt fed to GPT-4 can grade thousands of question-answer pairs per hour while keeping expert agreement above 80%.

Speed alone is not enough; automated evaluators must also earn trust.

  • The Alternative Annotator Test [4] offers a statistical framework for deciding when an LLM can replace human raters, requiring only a small audit set to quantify risk.
  • Prometheus 2 [5] pushes that frontier further, matching human and GPT-4 preferences across both direct scoring and pairwise ranking, and letting practitioners swap new criteria in or out without a full reward model retrain.

Together, these studies demonstrate that scalable model iteration pipelines are possible. Efficient and effective enterprise AI development depends on these three qualities:

  1. Scalability: Human effort is reserved for edge cases.
  2. Reliability: Rubric questions are concrete enough for prompt-based scoring.
  3. Alignment: Statistical safeguards confirm that automated judges remain aligned with expert opinion.

In practice, automated evaluation follows a layered approach:

  1. An initial LLM-as-a-judge evaluation screens for obvious failures such as toxicity or broken formatting.
  2. A rubric-based second pass scores surviving responses on factual accuracy, coherence, and domain alignment.
  3. Periodically, humans perform spot checks to recalibrate the first two automated layers.

This structure is scalable because most items flow through the first two layers without human touch, reliable because the rubric prompt grounds the judge’s decisions in transparent criteria, and flexible because teams can edit or add rubrics as product goals evolve, all without redefining an immutable ground truth that never truly existed for many open-ended tasks.

To understand why rubrics work to align human annotators with LLM judges, we turn to an analysis of the cognition that takes place when humans use a rubric during manual annotation.

Psychology of human manual annotation

Rubric-guided evaluation reshapes how human annotators think. A clear rubric externalizes the criteria that would otherwise sit in working memory, letting raters focus on one dimension at a time and cutting the mental juggling that drives inconsistency. 

  • Education studies report [6] sharper agreement when analytic rubrics replace holistic judgments, with human graders themselves crediting lower cognitive load for the improvement.
  • Recent benchmarking on explanation quality echoes that finding in AI datasets: the CUBE paper [7] showed that once annotators followed a structured rubric, agreement with expert adjudicators increased even on subjective tasks such as reasoning clarity and stance.
  • A 2024 experiment [8] went further, pairing crowd workers with GPT-4 labels. It found that the complementary strengths of humans and the model surfaced only after the workers aligned on the same rubric, pushing accuracy past either source alone.
  • Consistency needs proof, which is where inter-rater reliability metrics enter. Measures such as Cohen’s Kappa, Fleiss’ Kappa, and the intraclass correlation coefficient quantify how often raters converge beyond chance expectations [9].
  • Scale design matters. Most rubrics use five- or seven‑point Likert scales, but without clear anchors, these can introduce central tendency and interpretation bias. In fact, Best-Worst Scaling [10] has been shown to yield significantly higher inter‑rater reliability than traditional Likert ratings, and continuous (slider‑style) scales can further improve consistency over discrete options in dialogue evaluation tasks [11].

Rubrics do more than raise inter-annotator agreement; they also act as guardrails against bias and annotator fatigue.

  • The authors of [12] designed assessment rubrics for human evaluation of generated long-form answers to medical questions; the rubrics incorporated multiple dimensions of bias with the potential to contribute to equity-related harm, exposing failure modes missed by generic toxicity screens in LLMs.
  • On the more practical side, Pareto et al. shows [13] that well-structured instructions combined with task rotation slow the onset of annotation fatigue and maintain quality over many hours of work.

The same rubric that guides automated judges must first persuade human annotators to grade consistently. This guarantees that manual annotation becomes a controlled experiment rather than a leap of faith.

Conclusion

To conclude, the evidence argues for a layered, rubric-driven evaluation process:

  1. Define rubrics that surface failure modes in both training data and model responses.
  2. Train annotators until inter-rater scores stabilize.
  3. Rely on automated judges for faster grading of responses during model iteration.
  4. Run periodic expert spot checks to catch drift in both annotators and judges.

Following these steps delivers an evaluation methodology that provides the nuanced and multi-faceted metrics that apply to real-world scenarios, and reliably scales it up to the volume and pace needed for fast AI system development.

Check out Part 2 of our rubric series, where we unpack different types of rubrics and discuss when to apply them on various types of datasets.

At Snorkel AI, we don’t just write about high-quality evaluation, we deliver it. Through our Expert Data-as-a-Service, we partner with frontier model developers to curate world-class datasets for training and evaluating LLMs. Think: high-signal, expert-labeled data tailored to your most ambitious use cases for complex tool-using agents, reasoning-heavy workflows.

Interested in bringing rubric-based rigor to your models? Let’s talk.

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

References

  1. Breck, Eric, et al. “The ML test score: A rubric for ML production readiness and technical debt reduction.” 2017 IEEE international conference on big data (big data). IEEE, 2017.
  2. Biyani, Param, et al. “Rubicon: Rubric-based evaluation of domain-specific human ai conversations.” Proceedings of the 1st ACM International Conference on AI-Powered Software. 2024.
  3. Leng, Quinn. “Best Practices for LLM Evaluation of RAG Applications.” Databricks, Databricks, 2023, www.databricks.com/blog/LLM-auto-eval-best-practices-RAG.
  4. Calderon, Nitay, Roi Reichart, and Rotem Dror. “The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs.” arXiv preprint arXiv:2501.10970 (2025).
  5. Kim, Seungone, et al. “Prometheus 2: An open source language model specialized in evaluating other language models.” arXiv preprint arXiv:2405.01535 (2024).
  6. “Empirical exploration into academic grading and feedback approaches.” Edexia, 4 Apr. 2025, www.edexia.ai/news/empirical-explration-into-academic-grading-and-feedback-approaches
  7. Galvan-Sosa, Diana, et al. “Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset.” arXiv preprint arXiv:2503.23899 (2025).
  8. He, Zeyu, et al. “If in a crowdsourced data annotation pipeline, a gpt-4.” Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 2024.
  9. “Inter-rater Reliability: Definition, Examples, Calculation.” Encord, encord.com/blog/inter-rater-reliability/.
  10. Kiritchenko, Svetlana, and Saif M. Mohammad. “Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation.” arXiv preprint arXiv:1712.01765 (2017).
  11. Santhanam, Sashank, and Samira Shaikh. “Towards best experiment design for evaluating dialogue system output.” arXiv preprint arXiv:1909.10122 (2019).
  12. Pfohl, Stephen R., et al. “A toolbox for surfacing health equity harms and biases in large language models.” Nature Medicine 30.12 (2024): 3590-3600.
  13. Parti, Ayush. “Annotation fatigue: Why human data quality declines over time.” Pareto et al., 6 Feb. 2025, pareto.ai/blog/annotation-fatigue