惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Snorkel AI

Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman Code World Models and AutoHarness for LLM Agents Why coding agents need better data, evals, and environments Why coding agents need better data, evals, and environments Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Benchmarks should shape the frontier, not just measure it Benchmarks should shape the frontier, not just measure it Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory Building FinQA: An Open RL Environment for Financial Reasoning Agents How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks Coding agents don’t need to be perfect, they need to recover Closing the Evaluation Gap in Agentic AI SlopCodeBench: Measuring Code Erosion as Agents Iterate Introducing the Snorkel Agentic Coding Benchmark 2026: The year of environments Part V: Future direction and emerging trends The self-critique paradox: Why AI verification fails where it’s needed most A chat with the Terminal-Bench team Intelligence per watt: A new metric for AI’s future Terminal-Bench 2.0: Raising the bar for AI agent evaluation Snorkeling in RL environments Introducing SnorkelSpatial Scaling trust: rubrics in Snorkel’s quality process Evaluating multi-agent systems in enterprise tool use Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark Parsing isn’t neutral: why evaluation choices matter The science of rubric design The right tool for the job: An A-Z of rubrics Data quality and rubrics: how to build trust in your models Building the benchmark: inside our agentic insurance underwriting dataset Evaluating AI agents for insurance underwriting LLM observability: key practices, tools, and challenges Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI Data-centric development of an enterprise AI agent with Snorkel Building the data development platform for specialized AI LLM-as-a-judge for enterprises: evaluate model alignment at scale Why GenAI evaluation requires SME-in-the-loop for validation and trust Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation? Why enterprise GenAI evaluation requires fine-grained metrics to be insightful What is specialized GenAI evaluation, and why is it so critical to enterprise AI? LLM alignment techniques: 4 post-training approaches Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering? Why enterprises should embrace LLM distillation Retrieval-augmented generation (RAG) failure modes and how to fix them What is large language model (LLM) alignment? Databricks + Snorkel Flow: integrated, streamlined AI development Unlock proprietary data with Snorkel Flow and Amazon SageMaker LLM evaluation in enterprise applications: a new era in ML Snorkel AI joins the AWS ISV Accelerate Program and launches Snorkel Flow Availability in AWS Marketplace AI data development: a guide for data science projects SnorkelCon 2024: Inaugural Snorkel AI user conference gathers leaders from 30+ Fortune 500 companies Snorkel Flow 2024.R3: Supercharge your AI development with enhanced data-centric workflows Explore the new GenAI Evaluation Suite: Snorkel 2024.R3 New NLP features in Snorkel Flow 2024.R3 Enterprise data compliance and security review: Snorkel Flow 2024.R3 How a global financial services company built a specialized AI copilot accurate enough for production Task Me Anything: innovating multimodal model benchmarks Alfred: Data labeling with foundation models and weak supervision RAG: LLM performance boost with retrieval-augmented generation Call center AI for customer experience management: a case study New GenAI features, data annotation: Snorkel Flow 2024.R2 How data slices transform enterprise LLM evaluation Meta’s Llama 3.1 405B is the new Mr. Miyagi, now what? Meta’s new Llama 3.1 models are here! Are you ready for it? Data-centric AI with Snorkel and MinIO Weak supervision for non-categorical applications + superalignment Snorkel AI signs strategic collaboration agreement with AWS to help enterprises cross the demo-to-production chasm AI alignment made simple: innovative solutions for businesses How does the Snorkel Flow label model work? Vision language models: how LLMs boost image classification Long context models in the enterprise: benchmarks and beyond How to build production-grade RAG retrieval with Snorkel Flow How Bonito helps fine-tune specialized LLMs faster than ever Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment Role-based access controls in Snorkel Flow secure enterprise data Accelerating AI development in manufacturing with Snorkel Flow and AWS SageMaker How ROBOSHOT boosts zero-shot foundation model performance Discover what’s new in Snorkel Flow: Flexible data and LLM connectivity, secure data controls, and more! Faster than ever document intelligence with new Snorkel Flow FM-first workflow The art of data development for Enterprise LLMs Crossing the demo-to-production chasm with Snorkel Custom How Snorkel topped the AlpacaEval leaderboard (and why we’re not there anymore) CRFM’s HELM and enterprise LLM evaluation beyond accuracy How we achieved 89% accuracy on contract question answering Five sessions not to miss at Google Cloud Next 24 Content filtering breakthrough: Snorkel client reaches 96% recall in 3 days Here’s how Snorkel Flow + Google AI built an enterprise-ready model in a day Snorkel teams with Microsoft to showcase new AI research at NVIDIA GTC How Skill-it! enables faster, better LLM training Fine-tuned representation models boost LLM systems. Here’s how Enterprise GenAI to surge in 2024: survey results Large language model training: how three training phases shape LLMs LoRA: Low-Rank Adaptation for LLMs LLM distillation demystified: a complete guide Enterprises must shift their focus from models to data in AI development Insurance’s GenAI revolution: a business perspective Scaling human preferences in AI: Snorkel’s programmatic approach Building better enterprise AI: incorporating expert feedback in system development “Fall in love with your data”—Snorkel AI’s Enterprise LLM Summit
How LLM evaluation drives better models in Snorkel Flow
2024-12-17 · via Snorkel AI

Evaluating large language model (LLM) systems can be a labyrinthine process. At Snorkel AI, we’ve fine-tuned a methodical workflow that can help streamline this task. Our process ensures that organizations not only spot where their customized LLM falls short but also enable them to boost model performance to reach organizational goals 10-100 times faster than standard manual approaches.

Below, I’ll walk you through our structured approach using Snorkel Flow.

Introduction to the workflow

Our workflow consists of four critical steps:

  1. Onboarding
  2. Running a benchmark
  3. Refining artifacts
  4. Developing the LLM system

Each phase serves a unique purpose in establishing a robust evaluation process for large language model systems.

Step 1: Onboarding

The onboarding step is all about foundation-laying. Here, you’ll define four essential artifacts that will guide the rest of your evaluation journey:

  1. Criteria: Define the axes you care about—completeness, robustness, and other custom needs like not returning PII in responses. These criteria are pivotal for assessing model performance.
  2. Evaluators: These functions will assess each data point against the set criteria. Ideally, you’ll establish an evaluator for every criterion.
  3. Reference prompts: These are the prompts that you will feed your LLM each cycle. While they remain stable, the responses they elicit will evolve as your model improves.
  4. Data slices: Turn business objectives into data subsets, or data slices. This can include language preferences or question topics that departments are particularly interested in.

Using smart defaults available in Snorkel Flow accelerates setup, but the platform empowers users to fully customize their approach for specific use cases.

The Snorkel Flow LLM evaluation workflow

Step 2: Running the benchmark

Once users complete onboarding, they run an initial benchmark. This process generates a detailed evaluation report.

In Snorkel Flow, you’ll visualize data slices as rows and evaluators as columns. This matrix setup offers a comprehensive view of initial model performance.

Step 3: Refining artifacts

Establishing the initial benchmark is only the beginning. In step 3, users refine their artifacts to ensure they reflect real-world expectations and use cases.

In this phase, users may do some or all of the following:

  • Adjust criteria to reflect new insights or managerial feedback.
  • Collect ground truth labels and iterate on your evaluators to ensure they perform as expected.
  • Collect or generate additional data to ensure that all data slices include enough examples for reasonable evaluation.
  • Adjust or update data slices to ensure that the evaluation matrix includes the set of tasks most important to your team.

Through these “mini” iterations, you improve your artifact definition and ensure that everyone involved in the project trusts the artifacts to align your larger model.

Step 4: System development

With the “mini iteration” on refining artifacts out of the way, Snorkel Flow users can turn their attention to the “big iteration” of developing the LLM system.

The slice and evaluator matrix directs users’ attention to the areas where their system needs the most work. Users then either adjust their prompt template or craft an updated training set for LLM fine-tuning. In the case of fine-tuning, users might employ a quality model to filter synthetically generated examples or return to their team to collect additional “golden” responses.

After refining the system, the user generates another round of LLM evaluation. The resulting report will show where the system improved and by how much. More importantly, it will show where the system can improve further, directing the next round of iteration.

In just a few iterations, your system should be ready for production deployment according to your enterprise-specific evaluations applied to your most important tasks.

Example application: healthcare insurance chatbot

To bring these steps to life, imagine a healthcare insurance company striving to enhance its customer support chatbot. We’ll call this company “Be Healthy.”

Be Healthy’s initial setup looks like this:

  • Criteria: Initial criteria include accuracy, politeness, and responses free of competitor mentions.
  • Evaluators: Using Snorkel Flow’s integrated Jupyter notebook, Be Healthy’s team defines several evaluators, including a heuristic function that searches for competitor mentions, a response quality model, and an LLM-judge.
  • Reference prompts: Be Healthy’s team uses a collection of prompts users have historically submitted to the company’s chatbot.
  • Data slices: Slices include several topics, such as questions about coverage as well as Spanish-language queries as the company prepares to expand to South America.

Running a baseline evaluation and refining benchmarks

The Be Healthy team runs its baseline evaluation for the open-source model they plan to customize. This initial run spots a big gap in their suite of LLM prompts: their corpus includes zero questions written in Spanish.

This finding makes sense. The company conducts business exclusively in English, but it needs questions in Spanish to adapt its LLM to its forward-looking plans.

Short on time and resources, the users at Be Healthy select prompts written in English and use an LLM to translate them into Spanish. The user then uploads the newly translated prompts to the active corpus in Snorkel Flow, where the Spanish language slice classifier automatically identifies and integrates them.

After asking SMEs to annotate a small amount of ground truth, the team also spots another problem: the LLM-as-judge in the workflow accepted answers at a much higher rate than the SMEs. The data scientists and SMEs collaborate to iterate on their LLM-as-judge prompt template until the SME and LLM-judge agree with each other roughly 80% of the time.

Feeling more comfortable with their task coverage and their LLM judge, the Be Healthy team can move on to improving the LLM itself by iterating on its fine-tuning data.

Developing the system

With a baseline established, Be Healthy’s data scientists work with the organization’s subject matter experts (SMEs) to hand-label 100 model responses from a 10,000-item uploaded dataset. For each example, the data scientists ask the SMEs to not only approve or reject the response but also supply an explanation for their decision.


This project serves two purposes:

  1. To sanity-check the system’s programmatic classifiers.
  2. To gain insights to fuel scalable labeling functions.

In this case, it finds that Be Healthy’s chosen automatic evaluators work well enough for the organization to be comfortable. If they found a mismatch, this would be an opportunity to adjust them.

With explanations in hand, Be Healthy’s data scientists build a dozen labeling functions in Snorkel Flow to accept or reject historical model responses. This scales the training data set from 100 examples to cover about a third of the corpus—most of which get rejected.

The team then uses these 3,000 examples to train a quality model that they use to classify the remainder. The team then fine-tunes the LLM using only the ~2,500 accepted responses. Afterward, they see a marked improvement in performance according to their customized metrics.

By systematically repeating this workflow (each time focusing on where the model needs the most help), Be Healthy sees its chatbot achieve a performance jump from a mere 12% to an impressive 90% compliance with their chosen criteria—a significant feat signaling readiness for production deployment.

Visualizing progress in Snorkel Flow

A vital aspect of evaluating and fine-tuning LLM systems is the ability to effectively visualize your progress and data insights. Snorkel Flow offers a robust platform to do just that, ensuring all stakeholders can easily interpret results and improvements.

Model performance plots

In Snorkel Flow, we use performance plots to provide a clear, illustrative view of model iterations over time.

These plots showcase model iterations on the horizontal axis and evaluation metrics on the vertical axis. They also show a trend line for each slice, enabling an immediate understanding of performance trends. In “Be Healthy’s” case, you can see how each iteration of LLM fine-tuning pushes towards the project’s goals, such as higher completeness in response or lower instances of competitor mentions.

Comparing models

Snorkel Flow also offers comparative visualization as another powerful tool. By placing baseline models side by side against refined versions, users can directly observe the impact of their iterative adjustments.

Imagine toggling between initial and fine-tuned experiment outcomes, witnessing an increase from 12% to 90% in compliance with targeted criteria. This assists in assessing whether specific model adjustments meet expectations and also aids in justifying model choices to business leaders.

Slice coverage and evaluator matrices

Snorkel Flow’s evaluator matrix acts as a dashboard, offering a granular view of your project’s evaluation data. Here, data slices—representing different subsets of interest—are aligned with evaluator metrics in a tabular form.

This visualization highlights coverage gaps or overrigid evaluators, prompts enhancements, and guides discussions with subject matter experts.

Iteration overviews

Our iteration overview succinctly captures the project’s evolution. This visual timeline reveals your journey—from defining criteria and initial benchmarks to incorporating new data sources and refining evaluations—guiding your team effortlessly from inception to deployment readiness.

Incorporating these visual elements into your workflow improves team alignment and decision-making. It also demonstrates progress to leadership, fostering confidence along the path to deployment. Ultimately, leveraging these visualization tools in Snorkel Flow transforms complex data into comprehensible, strategic insights that drive better machine learning outcomes.

Snorkel Flow: powering better enterprise LLM systems

At Snorkel AI, we understand that mastery over LLM evaluation is more than a technological challenge; it’s a strategic business advantage. By following this structured workflow within Snorkel Flow, any enterprise can adeptly navigate LLM evaluations, driving both performance improvements and business value.

Ready to accelerate AI development?

Deploy production AI and ML applications 10-100x faster with Snorkel’s experts, using our proprietary technology.

Request a demo