惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Snorkel AI

Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman Code World Models and AutoHarness for LLM Agents Why coding agents need better data, evals, and environments Why coding agents need better data, evals, and environments Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development Benchmarks should shape the frontier, not just measure it Benchmarks should shape the frontier, not just measure it Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory Building FinQA: An Open RL Environment for Financial Reasoning Agents How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks Coding agents don’t need to be perfect, they need to recover Closing the Evaluation Gap in Agentic AI SlopCodeBench: Measuring Code Erosion as Agents Iterate Introducing the Snorkel Agentic Coding Benchmark 2026: The year of environments Part V: Future direction and emerging trends The self-critique paradox: Why AI verification fails where it’s needed most A chat with the Terminal-Bench team Intelligence per watt: A new metric for AI’s future Terminal-Bench 2.0: Raising the bar for AI agent evaluation Snorkeling in RL environments Introducing SnorkelSpatial Scaling trust: rubrics in Snorkel’s quality process Evaluating multi-agent systems in enterprise tool use Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark Parsing isn’t neutral: why evaluation choices matter The science of rubric design The right tool for the job: An A-Z of rubrics Data quality and rubrics: how to build trust in your models Building the benchmark: inside our agentic insurance underwriting dataset Evaluating AI agents for insurance underwriting LLM observability: key practices, tools, and challenges Anthropic Claude + AWS: revolutionizing pharma data analytics with Snorkel AI Data-centric development of an enterprise AI agent with Snorkel Building the data development platform for specialized AI LLM-as-a-judge for enterprises: evaluate model alignment at scale Why GenAI evaluation requires SME-in-the-loop for validation and trust Why enterprise GenAI evaluation requires fine-grained metrics to be insightful What is specialized GenAI evaluation, and why is it so critical to enterprise AI? LLM alignment techniques: 4 post-training approaches Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering? Why enterprises should embrace LLM distillation Retrieval-augmented generation (RAG) failure modes and how to fix them What is large language model (LLM) alignment? Databricks + Snorkel Flow: integrated, streamlined AI development How LLM evaluation drives better models in Snorkel Flow Unlock proprietary data with Snorkel Flow and Amazon SageMaker LLM evaluation in enterprise applications: a new era in ML Snorkel AI joins the AWS ISV Accelerate Program and launches Snorkel Flow Availability in AWS Marketplace AI data development: a guide for data science projects SnorkelCon 2024: Inaugural Snorkel AI user conference gathers leaders from 30+ Fortune 500 companies Snorkel Flow 2024.R3: Supercharge your AI development with enhanced data-centric workflows Explore the new GenAI Evaluation Suite: Snorkel 2024.R3 New NLP features in Snorkel Flow 2024.R3 Enterprise data compliance and security review: Snorkel Flow 2024.R3 How a global financial services company built a specialized AI copilot accurate enough for production Task Me Anything: innovating multimodal model benchmarks Alfred: Data labeling with foundation models and weak supervision RAG: LLM performance boost with retrieval-augmented generation Call center AI for customer experience management: a case study New GenAI features, data annotation: Snorkel Flow 2024.R2 How data slices transform enterprise LLM evaluation Meta’s Llama 3.1 405B is the new Mr. Miyagi, now what? Meta’s new Llama 3.1 models are here! Are you ready for it? Data-centric AI with Snorkel and MinIO Weak supervision for non-categorical applications + superalignment Snorkel AI signs strategic collaboration agreement with AWS to help enterprises cross the demo-to-production chasm AI alignment made simple: innovative solutions for businesses How does the Snorkel Flow label model work? Vision language models: how LLMs boost image classification Long context models in the enterprise: benchmarks and beyond How to build production-grade RAG retrieval with Snorkel Flow How Bonito helps fine-tune specialized LLMs faster than ever Walking safely before building flying saucer seatbelts: introducing Enterprise Alignment Role-based access controls in Snorkel Flow secure enterprise data Accelerating AI development in manufacturing with Snorkel Flow and AWS SageMaker How ROBOSHOT boosts zero-shot foundation model performance Discover what’s new in Snorkel Flow: Flexible data and LLM connectivity, secure data controls, and more! Faster than ever document intelligence with new Snorkel Flow FM-first workflow The art of data development for Enterprise LLMs Crossing the demo-to-production chasm with Snorkel Custom How Snorkel topped the AlpacaEval leaderboard (and why we’re not there anymore) CRFM’s HELM and enterprise LLM evaluation beyond accuracy How we achieved 89% accuracy on contract question answering Five sessions not to miss at Google Cloud Next 24 Content filtering breakthrough: Snorkel client reaches 96% recall in 3 days Here’s how Snorkel Flow + Google AI built an enterprise-ready model in a day Snorkel teams with Microsoft to showcase new AI research at NVIDIA GTC How Skill-it! enables faster, better LLM training Fine-tuned representation models boost LLM systems. Here’s how Enterprise GenAI to surge in 2024: survey results Large language model training: how three training phases shape LLMs LoRA: Low-Rank Adaptation for LLMs LLM distillation demystified: a complete guide Enterprises must shift their focus from models to data in AI development Insurance’s GenAI revolution: a business perspective Scaling human preferences in AI: Snorkel’s programmatic approach Building better enterprise AI: incorporating expert feedback in system development “Fall in love with your data”—Snorkel AI’s Enterprise LLM Summit
Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation?
2025-03-20 · via Snorkel AI

We’re taking a look at the research paper, LLMs can easily learn to reason from demonstration (Li et al., 2025), in this week’s community research spotlight. It focuses on how the structure of reasoning traces impacts distillation from models such as DeepSeek R1.

What’s the big idea regarding LLM reasoning distillation?

The reasoning capabilities of powerful models such as DeepSeek R1 and QwQ-32B-Preview can be efficiently and easily distilled into smaller, open models – and when doing so, the structure of a long chain-of-thought (CoT) is more important than the details within its individual steps. In fact, the details don’t even have to be correct.

The reasoning distillation process described, for each problem in a dataset, generates a response that includes a long CoT whose reasoning steps result in the correct solution. Then, a smaller, open model is fine-tuned with these problem-solution pairs – effectively transferring the reasoning capabilities of a large reasoning model such as DeepSeek R1 (the teacher) to a smaller one (the student).

It’s worth noting the authors performed reasoning distillation with just 17,000 examples applied via supervised fine-tuning (SFT) and LoRA, with the latter being data and parameter efficient.

We’ll dive into the two main experiments soon, but I think this chart says it all. Distilling reasoning capabilities from DeepSeek R1 to Qwen2.5-32B-Instruct resulted in it being more/less on par with OpenAI o1-preview, sometimes noticeably better (e.g., Math).

However, the key takeaway of this research paper is that the structure of a long CoT was more important than the content of its individual steps.

Long Chain-of-Thought concepts

Large reasoning models generate a long CoT by incorporating reflection, backtracking and self-validation. This long CoT helps it come to the correct conclusion, or in this research paper, to generate a correct answer via multi-step reasoning.

One little tidbit I found interesting was a list of words and phrases which are frequent indicators of reflection, backtracking and self-validation: 

  • “Alternatively”
  • “Wait”
  • “Just to be thorough”
  • “Just to make sure”
  • “Let me just double-check”
  • “Let me try another”
  • “Let me verify”
  • “Let me check”
  • “Hmm”
  • “But”
  • “Maybe I should consider”
  • “Maybe I can consider”

And for reference, here is the prompt they used.

“Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop a well-considered thinking process.

Please structure your response into two main sections: Thought and Solution.

In the Thought section, detail your reasoning process using the specified format: <|begin of thought|> thought with steps separated with \n\n} <|end of thought|> Each step should include detailed considerations such as analyzing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps.

In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin of solution|> final formatted, precise, and clear solution <|end of solution|>”

Now, try to solve the following question through the above guidelines:

LLM reasoning distillation experiments

DeepSeek-R1 → Qwen-32B-Instruct

In this experiment, the authors fine-tuned Qwen-32B-Instruct with the Bespoke-Stratos-17k reasoning dataset.

It contains coding questions from APPS and TACO, math questions from NuminaMATH and science/puzzle questions from STILL-2. These questions were paired with reasoning traces and correct solutions generated by DeepSeek R1 in order to create the training data.

The result is a 15.2% average improvement in accuracy with just 16k samples. Notably, their fine-tuned Qwen-32B-Instruct model approaches DeepSeek R1 accuracy on the AMC 2023 benchmark in particular.

QwQ-32B-Preview → Qwen-32B-Instruct

In the second experiment, the authors curated a similar dataset, but the reasoning traces and correct solutions were generated by QwQ-32B-Preview.

This time they experimented with both SFT and LoRA, as well as two different dataset sizes (7k and 17k). Interestingly enough, fine tuning with LoRA produced a model on par with SFT – and with OpenAI’s o1-preview. It demonstrates that reasoning distillation can be quite efficient in terms of both data and parameters. Further, this is where the authors began to realize that long CoT reasoning may not rely on knowledge, but rather on the structure of reasoning patterns.

Incorrect reasoning traces

The interesting finding in this research is that modifying reasoning traces to introduce errors has little impact on reasoning transfer via distillation and fine tuning. The authors introduced errors in different places in order to measure the impact of a reasoning trace’s overall structure vs. the content within individual steps.

Changes within reasoning steps:

  • Modified examples so the answers were wrong
  • Modified digits within reasoning steps (e.g. replaced with random numbers)
  • Removed common reasoning keywords such as “wait”

Changes to the reasoning structure:

  • Deleted random reasoning steps
  • Inserted random reasoning steps from other examples
  • Shuffled reasoning steps randomly (i.e., changed the order)

100% with wrong answers? Just 3.2% lower accuracy.
67% of reasoning step digits randomized? Just 4.3% lower accuracy.
100% of reasoning keywords removed? Just 3.3% lower accuracy.

All in all, even with the reasoning traces corrupted, reasoning distillation was effective – producing a model that was within a few percentage points of baseline accuracy. Now, what happens when the reasoning structure is corrupted?

67% of reasoning steps deleted? 12.8%% lower accuracy.
67% of reasoning steps randomly added? 14.3% lower accuracy.
67% of reasoning steps shuffled? 

In short, eliciting the overall structure of a long CoT is the most critical aspect when fine tuning models to improve their reasoning capabilities.

Final thoughts on LLM reasoning distillation

I found this research paper to be particularly insightful in light of the DeepSeek R1 release. We continue to see that models with exceptional reasoning capabilities are well within reach for everyone, including enterprises who many want or need to deploy open models. There are 270+ fine-tuned versions of DeepSeek R1 on Hugging Face now, and hundreds of datasets derived from it. We continue to believe that distillation is a particularly effective method for enterprises where specialized models are needed to support AI applications which have specific domain, business and use case requirements.
If you want to learn more about LLM distillation, register for our webinar tomorrow where I will go into a lot more detail. If you can’t attend, don’t worry, you can always watch the on-demand recording.