The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Seph Mard, Isabel Hulseman, Besmira Nushi, Piotr Januszewski, Gr · 2025-12-17 · via Hugging Face - Blog

Back to Articles

It has become increasingly challenging to assess whether a model’s reported improvements reflect genuine advances or variations in evaluation conditions, dataset composition, or training data that mirrors benchmark tasks. The NVIDIA Nemotron approach to openness addresses this by publishing transparent and reproducible evaluation recipes that make results independently verifiable.

NVIDIA released Nemotron 3 Nano 30B A3B with an explicitly open evaluation approach to make that distinction clear. Alongside the model card, we are publishing the complete evaluation recipe used to generate the results, built with the NVIDIA NeMo Evaluator library, so anyone can rerun the evaluation pipeline, inspect the artifacts, and analyze the outcomes independently.

We believe that open innovation is the foundation of AI progress. This level of transparency matters because most model evaluations omit critical details. Configs, prompts, harness versions, runtime settings, and logs are often missing or underspecified, and even small differences in these parameters can materially change results. Without a complete recipe, it’s nearly impossible to tell whether a model is genuinely more intelligent or simply optimized for a benchmark.

This blog shows developers exactly how to reproduce the evaluation behind Nemotron 3 Nano 30B A3B using fully open tools, configurations, and artifacts. You’ll learn how the evaluation was run, why the methodology matters, and how to execute the same end-to-end workflow using the NeMo Evaluator library so you can verify results, compare models consistently, and build transparent evaluation pipelines of your own.

Building a consistent and transparent evaluation workflow with NeMo Evaluator

A single, consistent evaluation system

Developers and researchers need evaluation workflows they can rely on, not one-off scripts that behave differently from model to model. NeMo Evaluator provides a unified way to define benchmarks, prompts, configuration, and runtime behavior once, then reuse that methodology across models and releases. This avoids the common scenario where the evaluation setup quietly changes between runs, making comparisons over time difficult or misleading.

Methodology independent of inference setup

Model outputs can vary by inference backend and configuration, so evaluation tools should never be tied to a single inference solution. Locking an evaluation tool to one inference solution would limit its usefulness. NeMo Evaluator avoids this by separating the evaluation pipeline from the inference backend, allowing the same configuration to run against hosted endpoints, local deployments, or third-party providers. This separation enables meaningful comparisons even when you change infrastructure or inference engines.

Built to scale beyond one-off experiments

Many evaluation pipelines work once and then break down as the scope expands. NeMo Evaluator is designed to scale from quick, single-benchmark validation to full model card suites and repeated evaluations across multiple models. The launcher, artifact layout, and configuration model support ongoing workflows, not just isolated experiments, so teams can maintain consistent evaluation practices over time.

Auditability with structured artifacts and logs

Transparent evaluation requires more than final scores. Each evaluation run produces structured results and logs by default, making it easy to inspect how scores were computed, understand score calculations, debug unexpected behavior, and conduct deeper analysis. Each component of the evaluation is captured and reproducible.

A shared evaluation standard

By releasing Nemotron 3 Nano 30B A3B with its full evaluation recipe, NVIDIA is providing a reference methodology that the community can run, inspect, and build upon. Using the same configuration and tools brings consistency to how benchmarks are selected, executed, and interpreted, enabling more reliable comparisons across models, providers, and releases.

Open evaluation for Nemotron 3 Nano

Open evaluation means publishing not just the final results, but the full methodology behind them, so benchmarks are run consistently, and results can be compared meaningfully over time. For Nemotron 3 Nano 30B A3B, this includes open‑source tooling, transparent configurations, and reproducible artifacts that anyone can run end‑to‑end.

Open-source model evaluation tooling

NeMo Evaluator is an open-source library designed for robust, reproducible, and scalable evaluation of generative models. Instead of introducing yet another standalone benchmark runner, it acts as a unifying orchestration layer that brings multiple evaluation harnesses under a single, consistent interface.

Under this architecture, NeMo Evaluator integrates and coordinates hundreds of benchmarks from many widely used evaluation harnesses, including NeMo Skills for Nemotron instruction-following, tool use, and agentic evaluations, as well as the LM Evaluation Harness for base model and pre-training benchmarks, and many more (full benchmark catalog). Each harness retains its native logic, datasets, and scoring semantics, while NeMo Evaluator standardizes how they are configured, executed, and logged.

This provides two practical advantages: teams can run diverse benchmark categories using a single configuration without rewriting custom evaluation scripts, and results from different harnesses are stored and inspected in a consistent, predictable way, even when the underlying tasks differ. The same orchestration framework used internally by NVIDIA’s Nemotron research and model‑evaluation teams is now available to the community, enabling developers to run heterogeneous, multi‑harness evaluations through a shared, auditable workflow.

Open configurations

We published the exact YAML configuration used for the Nemotron 3 Nano 30B A3B model card evaluation with NeMo Evaluator. This includes:

model inference and deployment settings
benchmark and task selection
benchmark-specific parameters such as sampling, repeats, and prompt templates
runtime controls including parallelism, timeouts, and retries
output paths and artifact layout

Using the same configuration means running the same evaluation methodology.

Open logs and artifacts

Each evaluation run produces structured, inspectable outputs, including per‑task results.json files, execution logs for debugging and auditability, and artifacts organized by task for easy comparison. This structure makes it possible to understand not only the final scores, but also how those scores were produced and to perform deeper analysis of model behavior.

The reproducibility workflow

Reproducing Nemotron 3 Nano 30B A3B model card results follows a simple loop:

Start from the released model checkpoint or hosted endpoint
Use the published NeMo Evaluator config
Execute the evaluation with a single CLI command
Inspect logs and artifacts, and compare results to the model card

The same workflow applies to any model you evaluate using NeMo Evaluator. You can point the evaluation at a hosted endpoint or a local deployment, including common inference providers such as HuggingFace, build.nvidia.com, and OpenRouter. The key requirement is access to the model, either as weights you can serve or as an endpoint you can call. For this tutorial, we use the hosted endpoint on build.nvidia.com.

Reproducing Nemotron 3 Nano benchmark results

This tutorial reproduces the evaluation results for NVIDIA Nemotron 3 Nano 30B A3B using NeMo Evaluator. The step-by-step tutorial, including the published configs used for the model card evaluation, is available on GitHub. Although we have focused this tutorial on the Nemotron 3 Nano 30B A3B, we also published recipes for the base model evaluation.

This walkthrough runs a comprehensive evaluation suite of the published configs used for the model card evaluation for NVIDIA Nemotron 3 Nano 30B A3B using the following benchmarks:

Benchmark	Accuracy	Category	Description
BFCL v4	53.8	Function Calling	Berkeley Function Calling Leaderboard v4
LiveCodeBench (v6 2025-08–2025-05)	68.3	Coding	Real-world coding problems evaluation
MMLU-Pro	78.3	Knowledge	Multi-task language understanding (10-choice)
GPQA	73.0	Science	Graduate-level science questions
AIME 2025	89.1	Mathematics	American Invitational Mathematics Exam
SciCode	33.3	Scientific Coding	Scientific programming challenges
IFBench	71.5	Instruction Following	Instruction following benchmark
HLE	10.6	Humanity's Last Exam	Expert-level questions across domains

For Model Card details, see the NVIDIA Nemotron 3 Nano 30B A3B Model Card. For a deep dive into the architecture, datasets, and benchmarks, read the full Nemotron 3 Nano Technical Report.

1. Install NeMo Evaluator Launcher

pip install nemo-evaluator-launcher

2. Set required environment variables

# NVIDIA endpoint access
export NGC_API_KEY="your-ngc-api-key"

# Hugging Face access
export HF_TOKEN="your-huggingface-token"

# Required only for judge-based benchmarks such as HLE
export JUDGE_API_KEY="your-judge-api-key"

Optional but recommended for faster reruns: export HF_HOME="/path/to/your/huggingface/cache"

3. Model endpoint

The evaluation uses the NVIDIA API endpoint hosted on build.nvidia.com:

target:
  api_endpoint:
    model_id: nvidia/nemotron-nano-3-30b-a3b
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

Evaluations can be run against common inference providers such as HuggingFace, build.nvidia.com, or OpenRouter, or anywhere that the model has an available endpoint.

If you're hosting the model locally or using a different endpoint:

nemo-evaluator-launcher run \
  --config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
  -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

4. Run the full evaluation suite

Preview the run without executing using --dry-run:

nemo-evaluator-launcher run \
  --config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
  --dry-run

From the examples directory, run the evaluation using the YAML configuration provided:

nemo-evaluator-launcher run \
  --config /path/to/examples/nemotron/local_nvidia_nemotron_3_nano_30b_a3b.yaml

Note that for quick testing, you can limit the number of samples by setting limit_samples:

nemo-evaluator-launcher run \
  --config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
  -o evaluation.nemo_evaluator_config.config.params.limit_samples=10

5. Running an individual benchmark

You can run specific benchmarks using the -t flag (from the examples/nemotron directory):

# Run only MMLU-Pro
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_mmlu_pro

# Run only coding benchmarks
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_livecodebench

# Run multiple specific benchmarks
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_gpqa -t ns_aime2025

6. Monitor execution and inspect results

# Check status of a specific job
nemo-evaluator-launcher status

# Stream logs for a specific job
nemo-evaluator-launcher logs <job-id>

Results are written to the defined output directory:

results_nvidia_nemotron_3_nano_30b_a3b/
├── artifacts/
│   └── <task_name>/
│       └── results.json
└── logs/
    └── stdout.log

Interpreting results

When reproducing evaluations, you may observe small differences in final scores across runs. This variance reflects the probabilistic nature of LLMs rather than an issue with the evaluation pipeline. Modern evaluation introduces several sources of non‑determinism: decoding settings, repeated trials, judge‑based scoring, parallel execution, and differences in serving infrastructure. All of which can lead to slight fluctuations.

The purpose of open evaluation is not to force bit-wise identical outputs, but to deliver methodological consistency with clear provenance of evaluation results. To ensure your evaluation aligns with the reference standard, verify the following:

Configuration: use the published NeMo Evaluator YAML without modification, or document any changes explicitly
Benchmark selection: run the intended tasks, task versions, and prompt templates
Inference target: verify you are evaluating the intended model and endpoint, including chat template behavior and reasoning settings when relevant
Execution settings: keep runtime parameters consistent, including repeats, parallelism, timeouts, and retry behavior
Outputs: confirm artifacts and logs are complete and follow the expected structure for each task

When these elements are consistent, your results represent a valid reproduction of the methodology, even if individual runs differ slightly. NeMo Evaluator simplifies this process, tying benchmark definitions, prompts, runtime settings, and inference configuration into a single auditable workflow to minimize inconsistencies.

Conclusion: A more transparent standard for open models

The evaluation recipe released alongside Nemotron 3 Nano represents a meaningful step toward a more transparent and reliable approach to open-model evaluation. We are moving away from evaluation as a collection of bespoke, "black box" scripts, and towards a defined system where benchmark selection, prompts, and execution semantics are encoded into a transparent workflow.

For developers and researchers, this transparency changes what it means to share results. A score is only as trustworthy as the methodology behind it and making that methodology public is what enables the community to verify claims, compare models fairly, and continue building on shared foundations. With open evaluation configurations, open artifacts, and open tooling, Nemotron 3 Nano demonstrates what that commitment to openness looks like in practice.

NeMo Evaluator supports this shift by providing a consistent benchmarking methodology across models, releases, and inference environments. The objective isn’t identical numbers on every run; it’s confidence in an evaluation methodology that is explicit, inspectable, and repeatable. And for organizations that need automated or large‑scale evaluation pipelines, a separate microservice offering provides an enterprise‑ready NeMo Evaluator microservice built on the same evaluation principles.

Use the published NeMo Evaluator evaluation configuration for an end-to-end walkthrough of the evaluation recipe.

Join the Community!

NeMo Evaluator is fully open source, and community input is essential to shaping the future of open evaluation. If there’s a benchmark you’d like us to support or an improvement you want to propose, open an issue, or contribute directly on GitHub. Your contributions help strengthen the ecosystem and advance a shared, transparent standard for evaluating generative models.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Building a consistent and transparent evaluation workflow with NeMo Evaluator

A single, consistent evaluation system

Methodology independent of inference setup

Built to scale beyond one-off experiments

Auditability with structured artifacts and logs

A shared evaluation standard

Open evaluation for Nemotron 3 Nano

Open-source model evaluation tooling

Open configurations

Open logs and artifacts

The reproducibility workflow

Reproducing Nemotron 3 Nano benchmark results

1. Install NeMo Evaluator Launcher

2. Set required environment variables

3. Model endpoint

4. Run the full evaluation suite

5. Running an individual benchmark

6. Monitor execution and inspect results

Interpreting results

Conclusion: A more transparent standard for open models