惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Recent Commits to openclaw:main
Recent Commits to openclaw:main
博客园 - 叶小钗
Stack Overflow Blog
Stack Overflow Blog
S
SegmentFault 最新的问题
D
DataBreaches.Net
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Threatpost
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
Microsoft Azure Blog
Microsoft Azure Blog
WordPress大学
WordPress大学
Engineering at Meta
Engineering at Meta
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
Cisco Blogs
PCI Perspectives
PCI Perspectives
Project Zero
Project Zero
G
Google Developers Blog
宝玉的分享
宝玉的分享
H
Heimdal Security Blog
美团技术团队
Schneier on Security
Schneier on Security
C
CERT Recently Published Vulnerability Notes
Martin Fowler
Martin Fowler
博客园 - 司徒正美
博客园 - 三生石上(FineUI控件)
Help Net Security
Help Net Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Google DeepMind News
Google DeepMind News
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
L
LINUX DO - 最新话题
O
OpenAI News
Hacker News - Newest:
Hacker News - Newest: "LLM"
N
Netflix TechBlog - Medium
S
Security Affairs
小众软件
小众软件
MongoDB | Blog
MongoDB | Blog
Blog — PlanetScale
Blog — PlanetScale
V
V2EX - 技术
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
F
Fortinet All Blogs
G
GRAHAM CLULEY
云风的 BLOG
云风的 BLOG
S
Secure Thoughts

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Featherless AI on Hugging Face Inference Providers 🔥 Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap
Aligning to What? Rethinking Agent Generalization in MiniMax M2
MiniMax · 2025-10-30 · via Hugging Face - Blog

Back to Articles

MiniMax's avatar

It's been fantastic to see the community dive into our new MiniMax M2, with many highlighting its impressive skills in complex agentic tasks. This is particularly exciting for me, as my work was centered on the agent alignment part of its post-training. In this post, I'd like to share some of the key insights and lessons we learned during that process.

The Real Agent Alignment Problem: Benchmarks or Reality?

If you've worked with LLM Agents, you've felt this pain: the same model can feel brilliant in one framework and useless in another. An agent might crush a tool-use leaderboard but fail spectacularly at a simple, real-world task. This gap between benchmark performance and practical usability is one of the biggest challenges in the field.

When we designed M2, we knew we had to tackle this problem head-on. This led us to two core, and sometimes conflicting, objectives:

  1. Excel on Open-Source Benchmarks. Benchmarks are essential for measuring "pure" capabilities. A benchmark like BrowseComp, for instance, tests for sophisticated search skills. While users will rarely ask a question as contrived as, "Find the paper where the third letter of the nth author's name is 'x'," a model that can solve it proves it has strong foundational abilities.
  2. Generalize Robustly to the Real World. This is the harder, more important part. A great agent must perform reliably across unfamiliar tools, IDEs/CLIs, agent scaffolding, and user setups. It can't be a one-trick pony; it needs to generalize.

So, who do we align with? The answer is both. We align with benchmarks to build skill, but we must ultimately align with the user by ensuring those skills work everywhere.

While the methods for acing benchmarks are a deep topic for another day, I want to focus on that second, trickier objective: How do we train an agent for the wild?

The Need for Interleaved Thinking

Early in the project, we hit a frustrating wall. Agent performance was inconsistent, and we struggled to diagnose why. After many discussions, especially with Professor @Junxian He and @Wenhu Chen, we arrived at our first major conclusion: Agents require Interleaved Thinking.

This means that an agent's internal monologue—its "thinking"—can and should happen at any point during a task, not just once at the beginning like a standard reasoning model. This design is critical for two reasons:

  1. Maintaining Focus on Long-Horizon Tasks. Complex agent tasks have extremely long contexts. A single thought process at the start isn't enough to maintain instruction-following and coherence.
  2. Adapting to External Perturbations. This is the crucial difference. Agent tasks introduce constant, unpredictable perturbations from the outside world (i.e., tool outputs). The model must be robust enough to handle these perturbations, diagnose errors, and extract useful information. The "thinking" process allows the model to constantly re-evaluate and adapt to new information from the environment.

This principle became a cornerstone of M2's effectiveness.

Pro Tip for M2 Users: Because M2 relies on Interleaved Thinking, its context is its memory. For best performance, you must retain the full session history, including the thinking steps. We've noticed that much of the community feedback about performance gaps stems from accidentally discarding this vital context, which is a common practice with simpler reasoning models.

True Generalization is About Perturbation

Our initial theory was simple: tool scaling is agent generalization.

We started with a minimal set of tools (a Python interpreter, search engine, a browser) to build a baseline of tool-calling capability. The roadmap was clear: scale up the number and variety of tools, and the agent's ability to generalize to unseen tools would naturally follow.

At first, this worked. Our benchmark scores climbed to respectable levels. But as we dug deeper, we realized we were solving the wrong problem. The model aced the tests, but if we changed the environment even slightly—like swapping to a different scaffolding framework—its performance would plummet. We were still far from our goal of a "practically useful" model.

This led to our second, more profound realization: Agent generalization is not just about adapting to new tools; it's about adapting to perturbations across the model's entire operational space.

Clipboard_Screenshot_1761817571

This sounds abstract, so let's break it down. Think about everything that can change in a single agent task:

  • The Tool Info and available toolset.
  • The System Prompt defining the agent's persona and rules.
  • The User Prompt and its specific goal.
  • The Environment itself (files, codebases, APIs).
  • The Tool Responses returned at each step. Our old "tool scaling" approach only addressed the first item. It ignored perturbations in all the other parts of the process. Armed with this new understanding, our team built a comprehensive data pipeline designed for full-trajectory generalization. The data it generates trains the model to be stable against perturbations at every step. The results have been incredibly encouraging. In internal tests, we threw obscure, "cold-start" scaffolding at M2—frameworks we'd barely considered—and its performance exceeded our expectations. Both its tool-calling and instruction-following abilities generalized beautifully.

What's Next?

Our work on M2 taught us an immense amount about agents, generalization, and data, but it has opened up more questions than it answered. Many of our ideas are still on the whiteboard. In the coming months, we will be exploring these frontiers even more deeply, and we can't wait to bring you the next generation of powerful and genuinely useful models.

Getting Involved

  • Use the Model: We sincerely hope you'll put M2 to the test. You can access it through our official channels or find the open-sourced version to conduct your own research.
  • Join Our Team: If these are the kinds of challenges that excite you, we're hiring. We are always looking for passionate people to join us in the mission to build AGI. Please send us your resume!