惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Snorkel AI

Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman Code World Models and AutoHarness for LLM Agents Why coding agents need better data, evals, and environments Why coding agents need better data, evals, and environments Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Benchmarks should shape the frontier, not just measure it Benchmarks should shape the frontier, not just measure it Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory Building FinQA: An Open RL Environment for Financial Reasoning Agents How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks Coding agents don’t need to be perfect, they need to recover Closing the Evaluation Gap in Agentic AI SlopCodeBench: Measuring Code Erosion as Agents Iterate Introducing the Snorkel Agentic Coding Benchmark 2026: The year of environments Part V: Future direction and emerging trends The self-critique paradox: Why AI verification fails where it’s needed most A chat with the Terminal-Bench team Intelligence per watt: A new metric for AI’s future Terminal-Bench 2.0: Raising the bar for AI agent evaluation Snorkeling in RL environments Introducing SnorkelSpatial Scaling trust: rubrics in Snorkel’s quality process Evaluating multi-agent systems in enterprise tool use Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark Parsing isn’t neutral: why evaluation choices matter The science of rubric design The right tool for the job: An A-Z of rubrics Data quality and rubrics: how to build trust in your models Building the benchmark: inside our agentic insurance underwriting dataset Evaluating AI agents for insurance underwriting LLM observability: key practices, tools, and challenges Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI Data-centric development of an enterprise AI agent with Snorkel LLM-as-a-judge for enterprises: evaluate model alignment at scale Why GenAI evaluation requires SME-in-the-loop for validation and trust Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation? Why enterprise GenAI evaluation requires fine-grained metrics to be insightful What is specialized GenAI evaluation, and why is it so critical to enterprise AI? LLM alignment techniques: 4 post-training approaches Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering? Why enterprises should embrace LLM distillation Retrieval-augmented generation (RAG) failure modes and how to fix them What is large language model (LLM) alignment? Databricks + Snorkel Flow: integrated, streamlined AI development How LLM evaluation drives better models in Snorkel Flow Unlock proprietary data with Snorkel Flow and Amazon SageMaker LLM evaluation in enterprise applications: a new era in ML Snorkel AI joins the AWS ISV Accelerate Program and launches Snorkel Flow Availability in AWS Marketplace AI data development: a guide for data science projects SnorkelCon 2024: Inaugural Snorkel AI user conference gathers leaders from 30+ Fortune 500 companies Snorkel Flow 2024.R3: Supercharge your AI development with enhanced data-centric workflows Explore the new GenAI Evaluation Suite: Snorkel 2024.R3 New NLP features in Snorkel Flow 2024.R3 Enterprise data compliance and security review: Snorkel Flow 2024.R3 How a global financial services company built a specialized AI copilot accurate enough for production Task Me Anything: innovating multimodal model benchmarks Alfred: Data labeling with foundation models and weak supervision RAG: LLM performance boost with retrieval-augmented generation Call center AI for customer experience management: a case study New GenAI features, data annotation: Snorkel Flow 2024.R2 How data slices transform enterprise LLM evaluation Meta’s Llama 3.1 405B is the new Mr. Miyagi, now what? Meta’s new Llama 3.1 models are here! Are you ready for it? Data-centric AI with Snorkel and MinIO Weak supervision for non-categorical applications + superalignment Snorkel AI signs strategic collaboration agreement with AWS to help enterprises cross the demo-to-production chasm AI alignment made simple: innovative solutions for businesses How does the Snorkel Flow label model work? Vision language models: how LLMs boost image classification Long context models in the enterprise: benchmarks and beyond How to build production-grade RAG retrieval with Snorkel Flow How Bonito helps fine-tune specialized LLMs faster than ever Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment Role-based access controls in Snorkel Flow secure enterprise data Accelerating AI development in manufacturing with Snorkel Flow and AWS SageMaker How ROBOSHOT boosts zero-shot foundation model performance Discover what’s new in Snorkel Flow: Flexible data and LLM connectivity, secure data controls, and more! Faster than ever document intelligence with new Snorkel Flow FM-first workflow The art of data development for Enterprise LLMs Crossing the demo-to-production chasm with Snorkel Custom How Snorkel topped the AlpacaEval leaderboard (and why we’re not there anymore) CRFM’s HELM and enterprise LLM evaluation beyond accuracy How we achieved 89% accuracy on contract question answering Five sessions not to miss at Google Cloud Next 24 Content filtering breakthrough: Snorkel client reaches 96% recall in 3 days Here’s how Snorkel Flow + Google AI built an enterprise-ready model in a day Snorkel teams with Microsoft to showcase new AI research at NVIDIA GTC How Skill-it! enables faster, better LLM training Fine-tuned representation models boost LLM systems. Here’s how Enterprise GenAI to surge in 2024: survey results Large language model training: how three training phases shape LLMs LoRA: Low-Rank Adaptation for LLMs LLM distillation demystified: a complete guide Enterprises must shift their focus from models to data in AI development Insurance’s GenAI revolution: a business perspective Scaling human preferences in AI: Snorkel’s programmatic approach Building better enterprise AI: incorporating expert feedback in system development “Fall in love with your data”—Snorkel AI’s Enterprise LLM Summit
Building the data development platform for specialized AI
2025-05-30 · via Snorkel AI

Today we’re launching two new products on our AI Data Development Platform that together create a complete solution for enterprises to specialize AI systems with expert data at scale. We are also announcing our $100M Series D financing, led by Addition.

The first wave of generative AI gave us powerful chatbots and co-pilots, using the vast amount of data contained on the public internet and alignment with generalist annotator crowds.

Now, we’re approaching a new frontier of agentic AI systems that reason, use tools, and act autonomously in specialized, high-impact settings.

However, agentic systems will not be deployed unless they become as trusted as human experts—and learning only from public internet data can’t possibly get us to this level of accuracy and trust, no matter how advanced the underlying LLMs become.

Building expert-level agentic AI requires expert-level data–and the next frontier of AI will be driven by using expertise to develop this kind of data for evaluating and tuning specialized AI at scale.

At Snorkel, our mission is exactly this—to enable every enterprise to turn expert knowledge into specialized AI at scale. Today, we’re excited to announce the general availability of two new products within Snorkel’s AI Data Development Platform, creating a complete solution to help enterprises actually deploy agentic AI to production in mission-critical settings:

  1. Snorkel Evaluate – which enables enterprises to evaluate their AI systems with trust and accuracy at scale using our programmatic data development technology.
  2. Snorkel Expert Data-as-a-Service – which is already powering leading LLM developers with frontier AI datasets.

We built these new offerings based on our learnings working side-by-side with our amazing customers to help get real AI systems into production, including with 7 of the top 10 US banks, Fortune 500 companies, federal agencies, and leading LLM providers.

Increasingly, we see enterprise leaders in AI using the differentiated edge of their internal expertise and data, and the accelerant of proprietary datasets built with scaled external expertise. We’re excited to now support both of these approaches, together in one unified platform for bringing sophisticated AI models and agentic systems to production.

We’re also excited to release a new enterprise-inspired agentic AI benchmark dataset and walkthrough to showcase the power of what can be done with Snorkel Evaluate and Expert Data-as-a-Service together. We believe new, domain-specific benchmarks and evaluation tools are critical to drive agentic system development in a safe and successful way, and will be releasing new industry-leading benchmarks regularly—several more to come soon!

Finally, we’re incredibly excited to announce Snorkel’s $100M Series D at a $1.3B valuation, led by Addition with participation from Prosperity7 Ventures and QBE Ventures, existing investors Greylock and Lightspeed, and others to support our continued research and innovation pushing the frontiers of specialized AI.

To learn more, join Snorkel AI and innovators from Accenture, BNY, Comcast, Stanford University, QBE, University of Wisconsin-Madison, and more on June 26 for an exclusive virtual live event, Developing Specialized Enterprise AI Agents.

Now, let’s dive in and briefly explore the two newly launched products!

Snorkel Evaluate

First: we’re incredibly excited to announce Snorkel Evaluate, our AI evaluation platform for specialized data development and labeling in enterprise settings where vibe checks and out-of-the-box metrics driven by simple LLM prompts are just not enough.

Evaluation is the new entrypoint to the AI development cycle—but there’s a major gap between what’s available on the market today and what it actually takes to develop specialized, trustworthy evaluations for real enterprise applications.

Imagine—and this is barely a metaphor–that you were running a standardized test for students, like the SAT–but asked the students to write their own exam questions, grade their own tests, and figure out where to improve their performance on their own! This is largely the state of AI evaluation today.

In Snorkel Evaluate, we’re bringing our unique programmatic data development and labeling technology to close these gaps around defining what your AI system needs to know, how its performance is graded, and where to go next in highly specialized, aligned ways, with workflows for:

  • Scalably generating and curating benchmark evaluation datasets—the collection of prompts and expected responses or actions that define what your AI system is supposed to do.
  • Developing specialized evaluators that label or grade an AI system’s output and actions, defining how your AI system is supposed to perform, and aligning them to unique enterprise objectives and standards—going beyond off-the-shelf LLM-as-a-judge approaches that fail to be accurate enough for specialized tasks.
  • Labeling the fine-grained slices of your benchmark dataset that correspond to meaningful subtasks or error modes, and that give actionable guidance on where an AI system needs to be improved.

With Snorkel Evaluate, enterprises can rapidly build specialized, scalable evaluations that are tightly aligned to their unique use cases, settings, and standards—leading to real production value. Our early design partners are seeing high-impact results:

  • Rox—a leading agentic AI startup for revenue organizations–built specialized evaluators with 99%+ accuracy, up from 75% accuracy with a basic LLM-as-a-judge approach—enabling sufficient trust to ship a critical email outbound feature.
  • A top-5 telecommunications company built specialized evaluators that averaged 88% accuracy in under a week for their agentic CSR system, up from an average of 55% accuracy using basic LLM-as-a-judges.

“We’re at an inflection point where AI agents must deliver real enterprise value. To unlock Claude’s full potential, we need new evaluation approaches with domain expertise and human feedback. Anthropic is committed to working with innovators like Snorkel to ensure AI systems are refined, reliable, and aligned to enterprise needs.”

—Kate Jensen, Head of Revenue, Anthropic

Snorkel Expert Data-as-a-Service

Next: we’re incredibly excited to announce Snorkel Expert Data-as-a-Service—a white-glove service for AI evaluation and post-training datasets, built from the ground up for specialized, expert-level data.

As we move to the current wave of specialized, high-impact agentic AI, LLM developers have realized that the key to success is not about getting more data, but getting the right high-quality expert datasets, with the right distributions for specific domains and use cases.

Sometimes, this expert data lives in your organization, or in the minds of your own experts; we’ve long been focused on this setting at Snorkel. But often, scaled expertise outside of your organization is a critical and complementary accelerant for achieving the breadth, depth, and speed needed in AI today.

With Snorkel Expert Data-as-a-Service, we’re excited to now support this latter approach as well through a global network of experts across 1000’s of domains in STEM, vertical and professional, and consumer and lifestyle areas, building specialized datasets for leading LLM developers in frontier areas like:

  • Agentic – including multi-turn with users, multi-step with reasoning, and multi-tool across various domain-specific and consumer settings
  • Expert knowledge and reasoning – across thousands of subdomains in STEM/academic, vertical/professional, and consumer domains
  • Coding – across a variety of languages, frameworks, and tasks, and including unit tests and more nuanced rubrics for evaluation, complex long-sequence and multi-turn software engineering tasks, and more
  • Multi-modal – including text, PDF, image, video, code

With Snorkel Expert Data-as-a-Service, we’re able to deliver custom, expert-level datasets with higher quality, more precise distributional control, and greater delivery speed by leveraging the same programmatic data development and labeling technology in our AI Data Development Platform, pioneered over our past decade of research and development.

Scaled, expert data represents the new rocket fuel for specialized AI systems and agents–and increasingly, enterprises will mix their own unique, in-house expertise and data with proprietary datasets they develop using outsourced expertise, in order to achieve the acceleration they need in today’s AI market. In both cases, it’s about the right data, not just more data–and we’re excited to support this with our unique technology platform and new Expert Data-as-a-Service.

Expert Data as the Key to Durable Differentiation in AI

As models and infrastructure in the AI space continue to standardize, developing unique proprietary expert data for evaluation and tuning will become the centerpoint of AI development—and the key to a differentiated edge.

We believe the leaders in enterprise agentic AI will combine their unique internal expertise and data with the accelerant of scaled external expert data, in a rapid, iterative cycle of evaluation and tuning—and now, you can do this all in one unified, enterprise-grade platform.

If this seems relevant to something you’re building with AI models or agentic systems, let us know—we’re excited to talk!  For more detail on what we’ve just launched:

  • Check out the open source benchmark dataset preview and walkthrough we just released around building an enterprise AI agent;
  • And mark your calendar for June 26th to join our event on developing specialized enterprise AI agents. 

We’re excited to see what we can build together!