Gaia2 and ARE: Empowering the community to study agents

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Featherless AI on Hugging Face Inference Providers 🔥 Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Clémentine Fourrier, Grégoire Mialon, Maxime Lecanu, Pierre Andr · 2025-09-22 · via Hugging Face - Blog

Back to Articles

In an ideal world, AI agents would be reliable assistants. When given a query, they would easily manage ambiguity in instructions, construct step-by-step plans, correctly identify necessary resources, execute those plans without getting sidetracked, and adapt to unexpected events, all while maintaining accuracy and avoiding hallucinations. However, developing agents and testing these behaviors is no small feat: if you have ever tried to debug your own agent, you’ve probably observed how tedious and frustrating this can be. Existing evaluation environments are tightly coupled with the tasks they evaluate, lack real-world flexibility, and do not reflect the messy reality of open-world agents: simulated pages never fail to load, events don’t spontaneously emerge, and asynchronous chaos is absent.

That’s why we’re very happy to introduce Gaia2, the follow-up to the agentic benchmark GAIA, allowing analysis of considerably more complex behaviors. Gaia2 is released with the open Meta Agents Research Environments (ARE) framework to run, debug and evaluate agents. ARE simulates complex real world-like conditions and can be customized to further study agents behaviors. Gaia2 dataset is released under CC by 4.0 license, and ARE under MIT license.

Gaia2: Agentic Evaluation on Real Life Assistant Tasks

GAIA is an agentic benchmark published in 2023, with 3 levels of information retrieval questions requiring tools, web browsing, and reasoning to solve. In 2 years, the easiest levels have become too easy for models, and the community is coming close to solving the hardest questions, so it was time for an entirely new and harder agent benchmark!

Here comes Gaia2, a follow up to GAIA, going way beyond it in terms of capabilities studied!

Where GAIA was read-only, Gaia2 is now a read-and-write benchmark, focusing on interactive behavior and complexity management. Agents are now evaluated not only on search and retrieval, but also on instruction following over ambiguous or time-sensitive queries, in a noisy environment with controlled failures - reflecting real-world conditions more than any other simulated environment. We want to test how agents manage tools or APIs that sometimes do not work, plan successions of actions with very specific time frames, and adapt to new events - a whole new range of complexity!

To do this, we use the following task groups (thanks to 1000 brand new human-created scenarios):

Execution: Multi-step instruction following and tool-use (e.g., contact updates)
Search: Cross-source information gathering (e.g., friend cities from WhatsApp)
Ambiguity Handling: Clarification of conflicting requests (e.g., scheduling conflicts)
Adaptability: Response to changes in the simulation (e.g., updating an email using follow up information)
Time/temporal Reasoning: Time-sensitive actions (e.g., cab orders after 3-minute delays)
Agent-to-Agent Collaboration: Communication between agents without direct API access
Noise Tolerance: Robustness to API failures and environmental instability

In the spirit of GAIA, scenarios do not require specialized knowledge: humans should in principle be able to get 100%, which allows easy debugging for model developers.

Want to explore the benchmark? Check out our dataset, which you can better display in our demo here.

How does Gaia2 run?

Gaia2 runs with ARE, an execution environment, where an agent of your choice has access to a combination of applications and associated pre-populated data.

For Gaia2, we created a smartphone mock-up environment, simulating what a human would use in their daily life. It contains real-world applications such as messaging (Email), utilities (Calendar, Contacts, Shopping, a FileSystem, …), and a chat interface to talk to the agent. All applications are also accessible to the agents through tool calling. Last but not least, the demo also contains a simulated persona’s history of conversations and app interactions.

All agent interactions are automatically recorded as structured traces during execution for deep dives and analysis: they include tool calls, API responses, model thoughts, timing metrics (e.g., response latency), user interactions, and so forth - and can all be exported as JSON.

Results

For reference, we compare a range of large open and closed source models: Llama 3.3-70B Instruct, Llama-4-Maverick, GPT-4o, Qwen3-235B-MoE, Grok-4, Kimi K2, Gemini 2.5 Pro, Claude 4 Sonnet, and GPT-5 in all reasoning modes.

All models are evaluated using the same setup (a uniform ReAct loop for consistency, temperature of 0.5, generation limit of 16K tokens), with a combination of model-as-a-judge (Llama 3.3 Instruct 70B) and exact-match evaluation depending on the particular task. All 101 tools (and the general environment description) are provided in the system prompt.

Among the evaluated models, the highest-scoring model overall as of September 2025 is GPT-5 with high reasoning, and the best open source model is Kimi K2.

Some capabilities appear to be already close to solved by the best models: execution of simple tool calls and instruction following (execution), and overall search (as we could have guessed from current results on GAIA). The ambiguity, adaptability, and noise splits remain challenging for now for all models, and it’s interesting to see that performance on what were considered complex agentic tasks (instruction following and search) is not a good proxy for performance on closer-to-real-world tasks. Last but not least, the hardest split for all models at the moment is the time one: it’s very hard at this moment for models to correctly handle time-sensitive actions (though this could likely be mitigated by the use of specialised tools and better temporal reasoning). Detailed analysis of these results can be found in the paper.

However, we believe it’s important to push reporting beyond raw scores: if the model is correct but took several thousand tokens to reach the correct solution, or ran for several hours, it is “not as good” as a model which succeeded orders of magnitude faster. We therefore also normalize scores for cost, quantified as the average number of LLM calls and output tokens (which both define a cost-performance Pareto frontier). In the paper you’ll find score vs monetary cost and time.

Compare with your favorite models! Evaluating on Gaia2

If you want to evaluate your model on Gaia2, you can follow these steps:

First, install Meta's Agent Research Environment in your Python environment of choice (uv, conda, virtualenv, ...)

pip install meta-agents-research-environments

Then, run the benchmark for all configurations: execution, search, adaptability, time and ambiguity. Don't forget to upload all results to the hub with the hf_upload kwarg!

are-benchmark run --hf meta-agents-research-environments/Gaia2     --split validation --config CONFIGURATION     --model YOUR_MODEL --model_provider YOUR_PROVIDER     --agent default     --max_concurrent_scenarios 2     --scenario_timeout 300     --output_dir ./monitored_test_results     --hf_upload YOUR_HUB_DATASET_TO_SAVE_RESULTS

Run the oracle to get your aggregated score file

are-benchmark judge --hf meta-agents-research-environments/Gaia2     --split validation --config CONFIGURATION     --agent default     --max_concurrent_scenarios 2     --scenario_timeout 300     --output_dir ./monitored_test_results --hf_upload YOUR_HUB_DATASET_TO_SAVE_RESULTS

Finally, add all the relevant information about your model in the README, and share it on the leaderboard to centralize Gaia2 traces here!

Beyond Gaia2: study your agents with ARE

Beyond benchmark scenarios, you can use Gaia2 apps and content in ARE to see if the model is able to correctly solve less verifiable tasks such as loading emails, writing follow-ups, adding events to the calendar or booking meetings - in sum, providing the perfect setup to evaluate your AI assistants through interaction!

You can also easily customise the environment, by 1) connecting your tools (via MCP or directly ) to test your agents on it; 2) implementing your own scenarios, including defining trigger or timed events (eg: after 2 minutes, the Mail app will receive a new email from Contact), to see how the agent is able to adapt to an evolving environment

(As the agents are by default json agents, they can’t mess up your machine, unless of course you connect them to external apps with unsafe rights. So, operate with caution when adding your own apps or using untrusted MCPs)

Here are several use cases that we’ve used ARE for:

Vibe-check any agent on real or simulated data, to study a variety of setups, with their own rules, tools, content, and verifications
Test agent tool calling and orchestration capabilities, either with local apps or MCP tools
Generate your own tool-calling trace to fine-tune tool calling models
Easily gather and reproduce existing agentic benchmarks in a unified framework
Debug and study agent to agent interactions on the fly within the user interface
Study model limitations in noisy environments (with API timeouts and ambiguity)

We recorded 3 videos so you can check some of these use cases (but of course, we hope the community gets creative with ARE :hugging_face:). For these videos, we use the default demo described above, which contains the simulated life of Linda Renne, PhD student in machine learning.

1) Testing an agent on a simple task: event organisation

To test how good the default model is at event organisation, let’s plan a birthday party!

We first ask the agent to text everyone in the Renne family about the user’s 30th birthday party on November 7. The default universe has 21 contacts in the list, including 5 Renne family members - Linda, the simulation “owner”, George and Stephie, her parents, Anna her sister, and Morgan her grandfather. The agent successfully goes through the contact list, finds the four family members, and texts them.

Next, we ask the agent to create a calendar invite and add them as invitees. The agent remembers the above context! It creates a calendar invite on the correct date and correctly adds the family members to it.

2) Understanding agents: deep diving the traces

ARE also allows us to check the traces behind the actions taken by the agent. Upon opening the Agent logs tool on the left, we can see the system prompt, the chain of thought, multi-step actions taken with the tools called, and the outcomes as neatly organised logs. Everything can be exported as json if you want to consult things offline!

3) Playing around and extending the demo: Connecting the agent to your own MCPs

In this last example, we connect ARE to a remote robot arm via MCP, so it can gesture things to us, then ask the agent to answer our yes or no questions by waving the robot arm! Here’s what it looks like.

But these examples are only very simple starting points, and we’re really looking towards what you’ll build! (For more advanced users, you can even directly install and edit the Meta-ARE code here.)

Conclusion

Gaia2 and ARE are new research tools that we hope will empower anyone to easily build more reliable and adaptable AI agents - by allowing easy experiments, making real-world evaluation accessible to anyone, as well as improving trust through transparent, reproducible benchmarks and debuggable traces.

We’d love to see what you will do with this project!

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Gaia2: Agentic Evaluation on Real Life Assistant Tasks

How does Gaia2 run?

Results

Compare with your favorite models! Evaluating on Gaia2

Beyond Gaia2: study your agents with ARE

1) Testing an agent on a simple task: event organisation

2) Understanding agents: deep diving the traces

3) Playing around and extending the demo: Connecting the agent to your own MCPs

Conclusion