惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Snorkel AI

Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman Code World Models and AutoHarness for LLM Agents Why coding agents need better data, evals, and environments Why coding agents need better data, evals, and environments Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Benchmarks should shape the frontier, not just measure it Benchmarks should shape the frontier, not just measure it Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory Building FinQA: An Open RL Environment for Financial Reasoning Agents How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks Coding agents don’t need to be perfect, they need to recover Closing the Evaluation Gap in Agentic AI SlopCodeBench: Measuring Code Erosion as Agents Iterate Introducing the Snorkel Agentic Coding Benchmark 2026: The year of environments Part V: Future direction and emerging trends The self-critique paradox: Why AI verification fails where it’s needed most A chat with the Terminal-Bench team Intelligence per watt: A new metric for AI’s future Terminal-Bench 2.0: Raising the bar for AI agent evaluation Snorkeling in RL environments Introducing SnorkelSpatial Scaling trust: rubrics in Snorkel’s quality process Evaluating multi-agent systems in enterprise tool use Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark Parsing isn’t neutral: why evaluation choices matter The science of rubric design The right tool for the job: An A-Z of rubrics Data quality and rubrics: how to build trust in your models Evaluating AI agents for insurance underwriting LLM observability: key practices, tools, and challenges Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI Data-centric development of an enterprise AI agent with Snorkel Building the data development platform for specialized AI LLM-as-a-judge for enterprises: evaluate model alignment at scale Why GenAI evaluation requires SME-in-the-loop for validation and trust Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation? Why enterprise GenAI evaluation requires fine-grained metrics to be insightful What is specialized GenAI evaluation, and why is it so critical to enterprise AI? LLM alignment techniques: 4 post-training approaches Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering? Why enterprises should embrace LLM distillation Retrieval-augmented generation (RAG) failure modes and how to fix them What is large language model (LLM) alignment? Databricks + Snorkel Flow: integrated, streamlined AI development How LLM evaluation drives better models in Snorkel Flow Unlock proprietary data with Snorkel Flow and Amazon SageMaker LLM evaluation in enterprise applications: a new era in ML Snorkel AI joins the AWS ISV Accelerate Program and launches Snorkel Flow Availability in AWS Marketplace AI data development: a guide for data science projects SnorkelCon 2024: Inaugural Snorkel AI user conference gathers leaders from 30+ Fortune 500 companies Snorkel Flow 2024.R3: Supercharge your AI development with enhanced data-centric workflows Explore the new GenAI Evaluation Suite: Snorkel 2024.R3 New NLP features in Snorkel Flow 2024.R3 Enterprise data compliance and security review: Snorkel Flow 2024.R3 How a global financial services company built a specialized AI copilot accurate enough for production Task Me Anything: innovating multimodal model benchmarks Alfred: Data labeling with foundation models and weak supervision RAG: LLM performance boost with retrieval-augmented generation Call center AI for customer experience management: a case study New GenAI features, data annotation: Snorkel Flow 2024.R2 How data slices transform enterprise LLM evaluation Meta’s Llama 3.1 405B is the new Mr. Miyagi, now what? Meta’s new Llama 3.1 models are here! Are you ready for it? Data-centric AI with Snorkel and MinIO Weak supervision for non-categorical applications + superalignment Snorkel AI signs strategic collaboration agreement with AWS to help enterprises cross the demo-to-production chasm AI alignment made simple: innovative solutions for businesses How does the Snorkel Flow label model work? Vision language models: how LLMs boost image classification Long context models in the enterprise: benchmarks and beyond How to build production-grade RAG retrieval with Snorkel Flow How Bonito helps fine-tune specialized LLMs faster than ever Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment Role-based access controls in Snorkel Flow secure enterprise data Accelerating AI development in manufacturing with Snorkel Flow and AWS SageMaker How ROBOSHOT boosts zero-shot foundation model performance Discover what’s new in Snorkel Flow: Flexible data and LLM connectivity, secure data controls, and more! Faster than ever document intelligence with new Snorkel Flow FM-first workflow The art of data development for Enterprise LLMs Crossing the demo-to-production chasm with Snorkel Custom How Snorkel topped the AlpacaEval leaderboard (and why we’re not there anymore) CRFM’s HELM and enterprise LLM evaluation beyond accuracy How we achieved 89% accuracy on contract question answering Five sessions not to miss at Google Cloud Next 24 Content filtering breakthrough: Snorkel client reaches 96% recall in 3 days Here’s how Snorkel Flow + Google AI built an enterprise-ready model in a day Snorkel teams with Microsoft to showcase new AI research at NVIDIA GTC How Skill-it! enables faster, better LLM training Fine-tuned representation models boost LLM systems. Here’s how Enterprise GenAI to surge in 2024: survey results Large language model training: how three training phases shape LLMs LoRA: Low-Rank Adaptation for LLMs LLM distillation demystified: a complete guide Enterprises must shift their focus from models to data in AI development Insurance’s GenAI revolution: a business perspective Scaling human preferences in AI: Snorkel’s programmatic approach Building better enterprise AI: incorporating expert feedback in system development “Fall in love with your data”—Snorkel AI’s Enterprise LLM Summit
Building the benchmark: inside our agentic insurance underwriting dataset
2025-07-11 · via Snorkel AI

Introduction

This post describes our specialized benchmark dataset developed through our Data-as-a-Service (DaaS) expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers a number of model-specific and actionable error modes that include basic tool use errors and a surprising number of insidious hallucinations from one provider in particular. Let’s get into the gory details! 

To see how top AI models performed on these tasks, read our companion post: Evaluating AI Agents for Insurance Underwriting.

Motivation

The past 9-12 months have witnessed an explosion in agents and models that are capable of interacting with larger—often specialized—ecosystems via tool use. The value proposition in enterprise scenarios is strong—but AI agents in these settings are often inaccurate and inefficient. These agents tackle business problems with the awkwardness of a fresh college graduate who is naive to the business world, with generated answers that sound informed but fall apart as soon as you dig beneath the surface.

The cause? Research and development in the AI field have largely focused on easily verifiable settings like coding and math, and simple generic use cases where off-the-shelf checks suffice–and data is plentiful. However, in specialized settings, off-the-shelf approaches fail and expert data is critical. To illustrate this idea, Snorkel Research is developing a series of benchmarks that include challenging features, such as:

  • Reasoning over specific and proprietary knowledge that models have never seen before
  • Tool use in complex, noisy, and specialized ecosystems
  • Multi-turn interaction with users with uncertainty

The Dataset

Commercial property and casualty insurance is a compelling example of all three challenges. Underwriters routinely make highly nuanced decisions about risk that require access to deep information about businesses and adherence to byzantine business rules and government regulations. We designed an underwriting dataset that offers all of the properties above and can be used to evaluate agents powered by state-of-the-art models. To do so, we leveraged our Data-as-a-Service expert network, working with experienced CPCUs. 

Our benchmark starts by simulating a company, All National Insurance, that sells directly to customers—thousands of fictional companies we created. The data capture interactions in which a junior underwriter has information about an applicant that is occasionally incomplete, with tasks that they need help with from our AI copilot. The AI copilot, in turn, has access to resources that include databases and underwriting guidelines in long-form documents. The tasks include, what, if any, other types of insurance the underwriter should offer the applicant based on their characteristics, whether the applicant qualifies as a “small business”, and so on. We gave underwriters limited information about the applicants, challenging the AI assistant to ask the right questions to solve the task. Specifically, underwriters had information they might receive in the real world if an applicant were briefly filling out an online form or sending an email, e.g., company name, number of employees, annual revenue.

We developed the agentic system in LangGraph with Model Context Protocol (MCP). We leveraged the flexibility of those frameworks to work with a wide variety of AI models, both open and closed source. We wrapped up each AI model we benchmarked as a ReAct agent. A high-level block diagram for the system is shown below:

Expert Data & Verification

Our expert network of CPCUs was vital for developing the data. Our goal is to challenge models with realistic, enterprise-relevant settings. We used CPCUs to ensure our scenarios are realistic, meaningful, and constitute a useful benchmark in the insurance vertical. Experts iterated on individual samples of data as well as the overall guidelines and data tables. They also helped develop business rules, realistic company profiles, and appropriate responses. For example, we repeatedly employed feedback from CPCUs on the realism of those business profiles (eg, whether a given company would ever conceivably apply for small business insurance). 

We note that expert data and verification are key, as without these, the resulting benchmark is highly unrealistic and unlikely to match real-life use cases. Indeed, a major part of the difficulty in our benchmark comes from how it corresponds to meaningful real-world settings—rather than toy scenarios that are so often used to benchmark agentic systems. 

Evaluating Agent Capabilities

Our distilled representation included a database with several tables and free text guidelines, with the overall challenge to the AI copilot to reason with respect to:

  • Asking the right questions of the underwriter
  • Determining which tools and resources were relevant to a given task
  • Using those tools and resources to solve the problem

Importantly, our fictional system included useful metadata about resources, so the AI copilots theoretically had everything they needed to solve these problems. However, the correct sequence of tool use behavior was sometimes very complex. 

Example: determine whether an applicant even qualifies as a small business, AI copilots had to:

  1. Find the proper NAICS classification code from the 2012 version of the schema (the schema are revised about every 5 years)
  2. Use this code to query a table from the US Small Business Administration on:
    1. Which feature of the business to use for qualification (number of employees vs annual revenue) 
    2. The value of that threshold


The AI copilots only had primary access to information about 2022 NAICS codes and mappings to the 2012 version via two other tables. So, they had to enter a chain of SQL queries to determine the correct criteria and thresholds, interacting with underwriters to obtain the information they needed in the process.

The dataset contained many others related to appetite, limits, and deductibles (and we are only scratching the surface here in our distilled representation!).

AI performance: evaluation and insights

Good benchmark data isn’t just about contests amongst frontier models—it is also actionable. To that end, we evaluate models over a number of criteria and slices of data that represent the perspectives of practitioners and business stakeholders. We provide granular insights into model performance and failure modes. Next we highlight some basics as of the date of this post.

Evaluation criteria

Using Snorkel’s evaluation suite, we developed several scalable measures, including task solution correctness, task solution conciseness, tool use correctness, and tool use efficiency.

Results

Our task solution correctness criteria is the most important. Our leaderboard highlights a wide range of accuracies across frontier models, from the single digits up to ~80%. Importantly, we see a tradeoff in test-time compute and accuracy, with the highest performing model showing the largest consumption of output tokens by far:

Task accuracy by two different components of test-time compute. Each datapoint is a single model average across conversations. 

How much of this is driven by models taking more turns to find the information it needs to answer questions, vs. consuming more tokens? We found significant correlations between accuracy and the number of turns the AI copilots had with the underwriter (probing for information) as well as the number of tools used—with some notable exceptions. For example, one model took an average of 7 turns with the underwriter, only to achieve a task accuracy score of about 55%. This model struggled to ask the right questions of the underwriter, even when it could use the tools correctly. 

We also see interesting patterns across tasks:

TaskAccuracy
Deductibles0.784
Business Classification0.772
Policy Limits0.762
Appetite Check0.615
Product Recommendations0.377

Accuracy by task, averaged across models. Accuracy here is computed after removing conversations with basic errors such as recursion errors in LangGraph.

Unsurprisingly, business classification (using 2022 NAICS codes) was one of the easiest across models: it is a basic task required for most of the others. If the agent gets that one wrong, it likely fails at many others. Policy limits and deductibles were also easy because underwriting guidelines contained the defaults applicable a large percentage of the time. So, if the AI copilot could read the guidelines it had a good shot here. 

The most challenging tasks forced models to use at least 3 or 4 tools and compose them correctly (using the results from one tool to use the next, etc), probing the underwriter for information along the way. We see this in appetite checks and product recommendations, in which some undetermined number of additional products need to be suggested (more on that below). But we also see this in important subsets of the policy limits and deductible tasks where the more nuanced underwriting logic is required. 

Error modes

Beyond basic stats, we leveraged our evaluation criteria to uncover interesting behaviors that suggest a need to develop models along many different axes. 

Two examples: 

  • Tool use errors: Across models, including top performers, agents made at least one tool call error in 36% of the conversations despite access to the metadata required to use tools properly. Surprisingly, we found no correlation with overall performance. In fact, even the three most accurate models made tool call errors in 30-50% of the conversations, often going back after basic tool call errors to retrieve metadata and redo the tool call correctly. We could have engineered prompts to direct models towards this behavior (and our experiments show that this works), but we aimed to simulate a realistic enterprise setting. Real, deployed copilot systems include dozens of possible tools and a combinatorially high and unwieldy number of combinations, making it impractical to simply engineer system prompts for every edge case. In addition, there is no reason why a frontier model like o4-mini, Gemini 2.5 Pro or Claude 4 Sonnet should require that kind of hand-holding. If they can answer questions about quantum field theory, why can’t they figure out what columns are in a table before writing a sql query?

Example trace with o4-mini showing a tool use error followed by a correction after looking up the table schema. 

  • Hallucinations based on pretrained domain knowledge: We observed a distinct error mode related to hallucinations based on pretrained domain knowledge. Some of the highest performing models have clearly been trained on insurance data, but almost to a fault. They occasionally hallucinated guidelines that might appear online but were not actually contained in the provided documentation. For example, in the task involving insurance product recommendations, the top performing models from OpenAI hallucinated several products not in the guidelines 15-45% of the time depending on the model. These hallucinations not only led to misleading answers, but also to misleading questions to the underwriter—probing for information that was ultimately irrelevant to the task.

Example trace ending showing hallucination from the top performing model overall. None of the listed insurance products were in the provided guidelines even though they are typical in other commercial insurance contexts.

These error modes illustrate a really basic point– contrary to one recent claim about the increasing irrelevance of proprietary knowledge, we continually see this relevance in real-world customer problems, and our dataset captures that. Like our customers, All National Insurance has its “secret sauce” of underwriting that you won’t find online. Models that hallucinate generic knowledge within a vertical work against that, inserting subtle but potentially catastrophic factual inaccuracies. 

Conclusion

Our dataset and findings across models illustrate the key point: even the most performant frontier models struggle in surprising ways to solve for tasks they have never seen before. From one vantage point, the tasks are indeed complex, requiring multiple tools in the correct order and nuanced, proprietary reasoning that can only be found in the associated documents. But from another, more basic vantage point, these tasks are still nowhere near the complexity we see in other challenging academic benchmarks. They theoretically require no more than 4 or 5 steps to solve with the right information from the user.

There is something interesting at play here related to user interaction and tool use. Just as a theoretical physicist may not be your best bet to build a bridge, a frontier reasoning model taken off the shelf is not your best bet to solve your business problem. It requires careful evaluation and development with benchmark data that contain the skills relevant to your domain of expertise. We’ve shown here how Snorkel’s Expert Data service can be leveraged to that end.