


























In this post, we will show you a specialized benchmark dataset we developed with our expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers several model-specific and actionable error modes, including basic tool use errors and a surprising number of insidious hallucinations from one provider. This is part of an ongoing series of benchmarks we are releasing across verticals and domains.
Want to see how this benchmark was built?
Go behind the scenes of our dataset design in Building the Benchmark: Inside Our Agentic Insurance Underwriting Dataset
Our agent-based insurance benchmark was motivated by observations we have made working with customers and in the field of AI more generally: The past 9-12 months have witnessed an explosion in agents and models capable of interacting with larger ecosystems via tool use. The value proposition in enterprise settings is strong, promising that models, even teams of models, can solve more complex tasks with more autonomy.
However, AI agents in enterprise settings are often inaccurate and inefficient—with test-time compute that is larger than necessary. These agents tackle business problems with the awkwardness of a fresh college graduate naive to the business world, with generated answers that sound informed by textbooks but fall apart as soon as you dig beneath the surface.
This happens because they are not tuned to the critical details of the enterprise problem. Research and development in the AI field have largely focused on easily verifiable settings like coding and math, and simple, generic use cases where off-the-shelf checks suffice. This does not easily translate to enterprise settings.
The Snorkel Research team has been addressing this gap based on observations from internal experiments and real customer use cases. Effective agent-based solutions must account for:
This turns out to be more challenging for models than one would expect based on reasoning benchmarks. To enable actionable insights, we have been developing a series of benchmarks that include:
Our goal is similar to open source initiatives such as Tau-Bench, but with intensive engagement of our expert network to ensure realistic, high-quality samples.
Commercial property and casualty insurance underwriters routinely make highly nuanced decisions about risk that require access to deep information about businesses and adherence to Byzantine business rules and government regulations. Their role offers a compelling example of the challenges AI agents will face in enterprise settings.
To this end, we designed an underwriting dataset that offers all of the properties above and can be used to evaluate agents powered by state-of-the-art models. To do so, we leveraged our Data-as-a-Service Expert Network, working with experienced Chartered Property Casualty Underwriters (CPCUs).
Our goal was to create a distilled representation of the initial stages of underwriting for small businesses in North America—initial applications for insurance at a fictional company, All National Insurance. Importantly, All National Insurance prefers to sell directly to customers, so no insurance agents or brokers are involved, putting the onus on the system to gather basic information. The data capture interactions in which a junior underwriter has information about the applicant that is occasionally incomplete, with tasks that they need help with from our AI copilot. The AI copilot, in turn, has access to resources that include databases and underwriting guidelines in long-form documents.

We developed the agentic system in LangGraph with Model Context Protocol (MCP). We leveraged the flexibility of those frameworks to work with a wide variety of AI models, both open and closed source. We wrapped up each AI model we benchmarked as a ReAct agent.
Future work on these datasets will explore alternative frameworks and methodologies, but for benchmarking purposes we wanted to begin where we see many practitioners start developing when they prototype.
We developed six basic types of tasks:
We synthesized thousands of fictional companies to represent the broad spectrum of small businesses in North America. A key aspect: feedback from CPCUs on the realism of those business profiles (eg, whether a given company would ever conceivably apply for small business insurance). Specifically, we sampled NAICS codes and business statistics and worked with a frontier model to generate structured profiles that we could then use in each sampled task.
We gave underwriters limited information about the applicants, challenging the AI assistant to ask the right questions to solve the task. Specifically, underwriters had information they might receive in the real world if an applicant were briefly filling out an online form or sending an email:
Our distilled representation included a database with several tables and free text guidelines, with the overall challenge to the AI copilot to reason with respect to:
Importantly, our fictional system included useful metadata about resources, so the AI copilots theoretically had everything they needed to solve these problems. However, the correct sequence of tool use behavior was sometimes very complex.
Example 1: To determine whether an applicant even qualifies as a small business, AI copilots had to:
The AI copilots only had primary access to information about 2022 NAICS codes and mappings to the 2012 version via two other tables. So, they had to enter a chain of SQL queries to determine the correct criteria and thresholds, interacting with underwriters to obtain the information they needed in the process.
Example 2: To determine whether an applicant was in appetite for property insurance, AI copilots had to:
These are just two examples. The dataset contained many others related to appetite, limits, and deductibles (and we are only scratching the surface here in our distilled representation!).
Our network of CPCUs was vital here for developing the data. One of our key focus areas here at Snorkel is challenging models with realistic, enterprise-relevant “gotchas.” So, we worked hand in hand with CPCUs to ensure our distilled scenarios resembled the real world sufficiently enough to constitute a useful benchmark in the insurance vertical. To that end, our CPCU network worked with us over several iterations on both individual samples of data as well as the overall guidelines and data tables. They also helped develop business rules, realistic company profiles, and appropriate underwriting responses.

Good benchmark data isn’t just about contests amongst frontier models. A useful dataset is actionable.
To that end, we are evaluating models over a number of criteria and slices of data that represent the perspectives of practitioners and business stakeholders. The efficacy of a complex AI system is not simply about academic correctness. It involves measures of efficiency (cost), the ability to interact with users and forage for information appropriately, the ability to make decisions under uncertainty, and the ability to solve for incompletely defined business objectives.
As part of this series, we will provide granular insights into model performance and failure modes. Here we will highlight some basics as of the date of this post.
Using Snorkel’s evaluation suite, we have developed several scalable measures:
Our task solution correctness criteria is the most important. Our leaderboard highlights a wide range of accuracies across frontier models, from the single digits up to ~80%.
Importantly, we see a tradeoff in test-time compute and accuracy, with the highest performing model showing the largest consumption of output tokens by far:

How much of this is driven by models taking more turns to find the information it needs to answer questions, vs. consuming more tokens? We dove deep on test-time compute and found significant correlations between accuracy and the number of turns the AI copilots had with the underwriter (probing for information) as well as the number of tools used. There were some notable exceptions, however. For example, one model took an average of 7 turns with the underwriter, only to achieve a task accuracy score of about 55%. Inspection of the traces indicated that this model simply struggled to ask the right questions, even when it could use the tools correctly. These findings led us to look more closely at efficiency, which we will briefly expand on below.
We also see interesting patterns across tasks:
| Task | Accuracy |
|---|---|
| Deductibles | 0.784 |
| Business Classification | 0.772 |
| Policy Limits | 0.762 |
| Appetite Check | 0.615 |
| Product Recommendations | 0.377 |
Accuracy by task, averaged across models. Accuracy here is computed after removing conversations with basic errors such as recursion errors in LangGraph.
Unsurprisingly, business classification (using 2022 NAICS codes) was one of the easiest across models. This is because it is a basic task required for most of the others. If the agent gets that one wrong, it likely fails at many others. Policy limits and deductibles were also easy because underwriting guidelines contained the defaults applicable a large percentage of the time. So, if the AI copilot could read the guidelines it had a good shot here.
The most challenging tasks forced models to use at least 3 or 4 tools and compose them correctly (using the results from one tool to use the next, etc), probing the underwriter for information along the way. We see this in appetite checks and product recommendations, in which some undetermined number of additional products need to be suggested (more on that below). But we also see this in important subsets of the policy limits and deductible tasks, which require more nuanced underwriting logic. For example, even though models overall performed the best on tasks involving deductibles, they were 25 points less accurate on average on this task for auto policies because of important exceptions in the underwriting guidelines. These exceptions required proper business classification and information from the underwriter about the number of vehicles owned and operated.
Beyond these basic stats, we leveraged our evaluation criteria to uncover interesting behaviors that suggest a need to develop models along many different axes.
Two examples:


These error modes illustrate a really basic point—contrary to one recent claim about the increasing irrelevance of proprietary knowledge, we continually see this relevance in real-world customer problems, and this benchmark dataset captures that. Like our customers, All National Insurance has its “secret sauce” of underwriting that you won’t find online. Models that hallucinate generic knowledge within a vertical work against that, inserting subtle but potentially catastrophic factual inaccuracies.
This dataset and the initial findings across models illustrate the key point: even the most performant frontier models struggle in surprising ways to solve for tasks they have never seen before.
From one vantage point, the tasks are indeed complex, requiring multiple tools in the correct order and nuanced, proprietary reasoning that can only be found in the associated documents. But from another, more basic vantage point, these tasks are still nowhere near the complexity we see in other challenging academic benchmarks. They theoretically require no more than 4 or 5 steps to solve with the right information from the user.
There is something interesting at play here related to user interaction and tool use. Just as a theoretical physicist may not be your best bet to build a bridge, a frontier reasoning model taken off the shelf is not your best bet to solve your business problem. It requires careful evaluation and development with benchmark data that contain the skills relevant to your domain of expertise. We’ve shown here how Snorkel’s Expert Data service can be leveraged to that end.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。