Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Torsten Scholak, Oleksiy Ostapenko, Raymond Li, Luke Kumar, Joel · 2025-11-19 · via Hugging Face - Blog

Back to Articles

We converted our 15B reasoning model to a Mamba hybrid achieving 2.1x throughput with minimal quality loss. The key? A non-obvious insight about what data to distill on, and why intuition fails here.

When MiniMax published their M2 post-mortem in October explaining why they abandoned efficient attention at 230B scale, the narrative briefly became "efficient attention is dead." Within days, Kimi Linear proved otherwise. The real lesson: it depends on your constraints.

Our constraint was simple: we had a strong 15B reasoning model and needed to make it efficient without starting over. No infinite compute for 20T-token pretraining. No luxury of architectural co-design from day one. Just a practical question: can you retrofit efficiency into an existing model through distillation?

Spoilers: yes, but only if you ignore your intuition about what data to use.

What We Built

The Apriel-H1 family: seven checkpoints spanning 25-40 Mamba layers (out of 50 total), showing the complete efficiency-quality frontier. Our flagship Apriel-H1-15b-Thinker-SFT achieves 2.1x throughput with minimal quality loss: MATH500 and MTBench improve a few points (0.90 → 0.92 and 8.30 → 8.58, respectively), while GSM8k (0.97 → 0.95), GPQA (0.59 → 0.55), and AIME24 (0.70 → 0.65) regress slightly. Total training: 76.8B tokens.

Apriel-H1-15b-Thinker-SFT (green) vs full-attention teacher (blue). Reasoning quality stays nearly flat across benchmarks while throughput increases 1.89-2.09x depending on context length.

The full details are in our Apriel-H1 paper. Here, we focus on the key insight that made it work.

The Non-Obvious Insight

Here's what we initially thought would work: just distill on pretraining data and round it out with some SFT.

The reasoning seemed solid. We're inserting completely new Mamba layers that have never seen data. These linear SSMs need to learn general-purpose token mixing from scratch. How can they become effective mixers unless they get exposure to the same broad distribution the original attention layers saw?

So we tried it. Then we tried mixing pretraining and SFT data. It didn't work. The distilled hybrids lost reasoning quality, sometimes dramatically.

What actually worked: high-quality reasoning traces from the teacher's SFT dataset.

Distilling a reasoning model isn't about transferring general next-token prediction. The base model already has that, and we started from a strong 15B foundation. What we're preserving is specific and fragile: the teacher's multi-step reasoning patterns.

Those patterns emerge from intricate attention mechanisms. Retrieval heads pulling context from thousands of tokens back. Induction heads recognizing and continuing logical chains. Long-range dependencies connecting premises to conclusions many steps later. When you replace attention wholesale with Mamba's linear recurrence, these computational mechanisms are disrupted. The hybrid must discover new paths to the same reasoning outcomes.

That discovery requires explicit examples where reasoning structure is visible and correct:

Multi-step math proofs where each thought follows from the previous
Coding tasks with clear logical dependencies
Scientific analysis with detailed explanatory chains

Pretraining data, on the other hand, is too noisy and too diffuse. The reasoning signal gets lost. You need concentrated examples of the specific capability you're trying to preserve.

Once we understood the data choice, our distillation method became clear too. We used reverse KL divergence (temperature 1) rather than forward KL. Reverse won consistently. Why? We're training on problems where the teacher has high confidence and clear structure. Reverse KL's mode-seeking behavior encourages the student to commit to those high-confidence predictions. When your teacher is confident and correct, you want your student to be confident too.

This insight is the key to the whole approach: match your distillation data to the capability you're preserving, not the capability you're building.

How to Apply It: Staged Distillation

You can't just swap 40 attention layers for Mamba and hope. We learned this the hard way, and eventually developed a staged distillation procedure to get there reliably.

Stage 1: Identify least-important layers. We used a Leave-One-Out (LOO) analysis on MMLU: remove each layer, replace with identity, then measure the drop. Sort by importance, replace the bottom 25 with Mamba-in-Llama (MIL) initialized mixers. Distill end-to-end. This worked for our H-25 checkpoint.

Stage 2: Progressive conversion beyond 25 layers. LOO broke down past 25 layers because layers unimportant in isolation became critical in combination. To address this, we developed a dynamic heuristic we call MIL-Mamba-Replacement (MMR). For each remaining attention layer, we initialize a Mamba mixer with MIL, run 100 training steps, and record the distillation loss. Layers converging to lower loss are "easier" to replace. This captures training dynamics rather than static importance.

We progressed incrementally: 25 → 27 → 30 → 34 → 37 → 40 Mamba layers, grouping replacements by MMR scores. Each checkpoint distills from the previous.

Stage 3: End-to-end training on SFT data. After reaching the target Mamba layer count, we did a final SFT pass until reasoning performance stabilized. After 55.9B distillation tokens and 20.9B SFT tokens, this produced our final Apriel-H1-15b-Thinker-SFT model.

The complete efficiency frontier. Each checkpoint shows cumulative training tokens. Our flagship H-30-SFT (released as Apriel-H1-15b-Thinker-SFT) used 76.8B total for 2.1x throughput at 0.76 average score. The aggressively converted H-40 variant used 136.5B tokens for 3.4x throughput. For reference: NVIDIA's Nemotron-Nano-9B-v2 achieves 4.6x at 0.77 score but required training from scratch with orders of magnitude more compute.

Making It Reproducible: Fast-LLM

We built all this on Fast-LLM, our open-source training framework. The core architectural principle: large language model transformers should be modular. Attention and Mamba are different implementations of the same "mixing" interface, and can be swapped freely.

Here's a hybrid architecture in Fast-LLM's config format:

decoder:
  type: "pattern"
  blocks:
    attention_block:
      mixer:
        type: "attention"
        heads: 32
        head_groups: 8
        head_size: 128
      mlp:
        type: "gated"
        activation: "silu"
    mamba_block:
      mixer:
        type: "mamba"
        d_inner: 4096
        state_size: 16
        dt_rank: 16
      mlp:
        type: "gated"
        activation: "silu"
  num_blocks: 50
  pattern: ["attention_block", "attention_block", "mamba_block", ...]

The pattern field specifies layer order. For Apriel-H1-15b-Thinker-SFT: 30 mamba_block, 20 attention_block, placed by importance. That's it.

Distillation is configuration too:

model:
  base_model:
    head:
      distillation_model: teacher
      distillation_loss_implementation: reverse_kl
reference_models:
  teacher:
    pretrained:
      format: mistral
      path: path/to/Apriel-Nemotron-15b-Thinker

Fast-LLM handles gradient accumulation, distributed training, tensor parallelism, checkpointing, everything you need for large-scale experimentation. It's open source, and licensed under Apache 2.0. You can reproduce this work because we designed the infrastructure to make it reproducible.

FAQs

Why release all checkpoints? Because optimal depends on your constraints. H-30 offers the best balance. H-40 maximizes throughput for latency-critical workloads. The intermediate checkpoints let you choose your exact trade-off.

Why do you get different speedups at different context lengths? Mamba's linear complexity advantage grows with sequence length, and attention degrades quadratically.

Why did you only try Mamba? We used Mamba-1 for three reasons: it has a proven distillation track record, has shown strong empirical performance, and was simple to implement in our framework. It let us focus on the data question first.

What were the Mamba hyperparameters? State size 16, DT rank 16, inner dimension 4096. For our GQA setup in Apriel we expanded B (input projection) and x (state) to match total attention heads following M1.

Why didn't you try more advanced conversion methods? We used Mamba-in-Llama initialization and knowledge distillation rather than MOHAWK's multi-stage procedure because the latter didn't show significant advantages in preliminary experiments.

Why did you only SFT the H-30 model? We only applied SFT to H-30 to validate that distilled hybrids can be improved through standard post-training. The other checkpoints are pure distillation but can be fine-tuned similarly.

Why didn't you explore RL? This was a scoping decision to isolate the distillation question: can you transfer reasoning via knowledge distillation alone? Answer: yes. But RL should close remaining quality gaps further. We are exploring RL for future iterations.

Did you really show that Apriel-H1 matches full-attention reasoning at similar compute budgets? We didn't do an apples-to-apples comparison between full-attention Apriel and a hybrid trained identically from pretraining forward. That would require repeating all mid-training and post-training of the teacher with the Apriel-H1 architecture, which was beyond our compute budget. What we can claim though is that retrofitting efficiency via distillation is practical and effective, and that the resulting hybrids can be fine-tuned to match or exceed the teacher's reasoning quality.

The Production Reality

We've implemented Apriel-H1 in Hugging Face Transformers and vLLM. Transformers integration is straightforward. We ship a new model class with interchangeable attention and Mamba layers. vLLM integration uses their recent Mamba cache operations for continuous batching, prefix caching, and chunked prefill. The vLLM plugin is ready. We are currently waiting for final legal approval to open-source it.

Honest assessment: Deploying hybrids today means rough edges. The tooling is maturing fast but isn't turnkey. You will write custom code, validate numerical behavior carefully, and work around framework limitations. For teams that can absorb that cost, throughput gains are worth it. For those that can't, waiting might be the right call.

Takeaway

Most teams don't have infinite compute for 20T-token pretraining. If you've invested in a strong base model and need efficiency gains, this work shows a practical path: distill into hybrids using high-quality task-specific data that matches the capability you're preserving.

The surprising finding, use reasoning data to distill reasoning, seems obvious in retrospect but contradicts initial intuition. We validated it, explained why it works, and built the infrastructure to make it reproducible.

Try It

Models: Apriel-H1 Collection on HuggingFace
Training framework: Fast-LLM on GitHub
Teacher model: Apriel-Nemotron-15B-Thinker
Paper: Apriel-H1: Towards Efficient Enterprise Reasoning Models

Found something broken? File an issue. Discovered a better layer placement heuristic? Tell us. Built something interesting on Apriel-H1? We'd love to see it.

Citation:

@article{apriel-h1-2025,
  title={Apriel-H1: Towards Efficient Enterprise Reasoning Models},
  author={SLAM Lab, ServiceNow},
  journal={arXiv preprint arXiv:2511.02651},
  year={2025}
}

Core contributors: Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Torsten Scholak
Contributors: Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra
Technical co-leads: Torsten Scholak, Sathwik Tejaswi Madhusudhan

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

What We Built

The Non-Obvious Insight

How to Apply It: Staged Distillation

Making It Reproducible: Fast-LLM

FAQs

The Production Reality

Takeaway

Try It