


























Evaluating large language model (LLM) systems can be a labyrinthine process. At Snorkel AI, we’ve fine-tuned a methodical workflow that can help streamline this task. Our process ensures that organizations not only spot where their customized LLM falls short but also enable them to boost model performance to reach organizational goals 10-100 times faster than standard manual approaches.
Below, I’ll walk you through our structured approach using Snorkel Flow.
Our workflow consists of four critical steps:
Each phase serves a unique purpose in establishing a robust evaluation process for large language model systems.
The onboarding step is all about foundation-laying. Here, you’ll define four essential artifacts that will guide the rest of your evaluation journey:
Using smart defaults available in Snorkel Flow accelerates setup, but the platform empowers users to fully customize their approach for specific use cases.

Once users complete onboarding, they run an initial benchmark. This process generates a detailed evaluation report.
In Snorkel Flow, you’ll visualize data slices as rows and evaluators as columns. This matrix setup offers a comprehensive view of initial model performance.
Establishing the initial benchmark is only the beginning. In step 3, users refine their artifacts to ensure they reflect real-world expectations and use cases.
In this phase, users may do some or all of the following:
Through these “mini” iterations, you improve your artifact definition and ensure that everyone involved in the project trusts the artifacts to align your larger model.
With the “mini iteration” on refining artifacts out of the way, Snorkel Flow users can turn their attention to the “big iteration” of developing the LLM system.
The slice and evaluator matrix directs users’ attention to the areas where their system needs the most work. Users then either adjust their prompt template or craft an updated training set for LLM fine-tuning. In the case of fine-tuning, users might employ a quality model to filter synthetically generated examples or return to their team to collect additional “golden” responses.
After refining the system, the user generates another round of LLM evaluation. The resulting report will show where the system improved and by how much. More importantly, it will show where the system can improve further, directing the next round of iteration.
In just a few iterations, your system should be ready for production deployment according to your enterprise-specific evaluations applied to your most important tasks.
To bring these steps to life, imagine a healthcare insurance company striving to enhance its customer support chatbot. We’ll call this company “Be Healthy.”
Be Healthy’s initial setup looks like this:
The Be Healthy team runs its baseline evaluation for the open-source model they plan to customize. This initial run spots a big gap in their suite of LLM prompts: their corpus includes zero questions written in Spanish.
This finding makes sense. The company conducts business exclusively in English, but it needs questions in Spanish to adapt its LLM to its forward-looking plans.
Short on time and resources, the users at Be Healthy select prompts written in English and use an LLM to translate them into Spanish. The user then uploads the newly translated prompts to the active corpus in Snorkel Flow, where the Spanish language slice classifier automatically identifies and integrates them.
After asking SMEs to annotate a small amount of ground truth, the team also spots another problem: the LLM-as-judge in the workflow accepted answers at a much higher rate than the SMEs. The data scientists and SMEs collaborate to iterate on their LLM-as-judge prompt template until the SME and LLM-judge agree with each other roughly 80% of the time.
Feeling more comfortable with their task coverage and their LLM judge, the Be Healthy team can move on to improving the LLM itself by iterating on its fine-tuning data.
With a baseline established, Be Healthy’s data scientists work with the organization’s subject matter experts (SMEs) to hand-label 100 model responses from a 10,000-item uploaded dataset. For each example, the data scientists ask the SMEs to not only approve or reject the response but also supply an explanation for their decision.
This project serves two purposes:
In this case, it finds that Be Healthy’s chosen automatic evaluators work well enough for the organization to be comfortable. If they found a mismatch, this would be an opportunity to adjust them.
With explanations in hand, Be Healthy’s data scientists build a dozen labeling functions in Snorkel Flow to accept or reject historical model responses. This scales the training data set from 100 examples to cover about a third of the corpus—most of which get rejected.
The team then uses these 3,000 examples to train a quality model that they use to classify the remainder. The team then fine-tunes the LLM using only the ~2,500 accepted responses. Afterward, they see a marked improvement in performance according to their customized metrics.
By systematically repeating this workflow (each time focusing on where the model needs the most help), Be Healthy sees its chatbot achieve a performance jump from a mere 12% to an impressive 90% compliance with their chosen criteria—a significant feat signaling readiness for production deployment.
A vital aspect of evaluating and fine-tuning LLM systems is the ability to effectively visualize your progress and data insights. Snorkel Flow offers a robust platform to do just that, ensuring all stakeholders can easily interpret results and improvements.
In Snorkel Flow, we use performance plots to provide a clear, illustrative view of model iterations over time.
These plots showcase model iterations on the horizontal axis and evaluation metrics on the vertical axis. They also show a trend line for each slice, enabling an immediate understanding of performance trends. In “Be Healthy’s” case, you can see how each iteration of LLM fine-tuning pushes towards the project’s goals, such as higher completeness in response or lower instances of competitor mentions.
Snorkel Flow also offers comparative visualization as another powerful tool. By placing baseline models side by side against refined versions, users can directly observe the impact of their iterative adjustments.
Imagine toggling between initial and fine-tuned experiment outcomes, witnessing an increase from 12% to 90% in compliance with targeted criteria. This assists in assessing whether specific model adjustments meet expectations and also aids in justifying model choices to business leaders.
Snorkel Flow’s evaluator matrix acts as a dashboard, offering a granular view of your project’s evaluation data. Here, data slices—representing different subsets of interest—are aligned with evaluator metrics in a tabular form.
This visualization highlights coverage gaps or overrigid evaluators, prompts enhancements, and guides discussions with subject matter experts.
Our iteration overview succinctly captures the project’s evolution. This visual timeline reveals your journey—from defining criteria and initial benchmarks to incorporating new data sources and refining evaluations—guiding your team effortlessly from inception to deployment readiness.
Incorporating these visual elements into your workflow improves team alignment and decision-making. It also demonstrates progress to leadership, fostering confidence along the path to deployment. Ultimately, leveraging these visualization tools in Snorkel Flow transforms complex data into comprehensible, strategic insights that drive better machine learning outcomes.
At Snorkel AI, we understand that mastery over LLM evaluation is more than a technological challenge; it’s a strategic business advantage. By following this structured workflow within Snorkel Flow, any enterprise can adeptly navigate LLM evaluations, driving both performance improvements and business value.
Deploy production AI and ML applications 10-100x faster with Snorkel’s experts, using our proprietary technology.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。