惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

U
Unit 42
V
V2EX
Martin Fowler
Martin Fowler
博客园 - Franky
P
Proofpoint News Feed
P
Palo Alto Networks Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
B
Blog
The Register - Security
The Register - Security
Latest news
Latest news
S
Security @ Cisco Blogs
Simon Willison's Weblog
Simon Willison's Weblog
Recorded Future
Recorded Future
大猫的无限游戏
大猫的无限游戏
M
Microsoft Research Blog - Microsoft Research
Scott Helme
Scott Helme
T
Tailwind CSS Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
True Tiger Recordings
有赞技术团队
有赞技术团队
I
Intezer
Cisco Talos Blog
Cisco Talos Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
The GitHub Blog
The GitHub Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
T
Tenable Blog
博客园 - 叶小钗
Hugging Face - Blog
Hugging Face - Blog
Hacker News: Ask HN
Hacker News: Ask HN
S
Security Archives - TechRepublic
F
Future of Privacy Forum
爱范儿
爱范儿
PCI Perspectives
PCI Perspectives
H
Help Net Security
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
T
The Blog of Author Tim Ferriss
MyScale Blog
MyScale Blog
N
Netflix TechBlog - Medium
罗磊的独立博客
Apple Machine Learning Research
Apple Machine Learning Research
MongoDB | Blog
MongoDB | Blog
Security Latest
Security Latest
美团技术团队
博客园 - 三生石上(FineUI控件)
S
Schneier on Security
量子位
C
CERT Recently Published Vulnerability Notes
SecWiki News
SecWiki News

Hugging Face - Blog

Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook OlmoEarth v1.1: A more efficient family of models Introducing the Ettin Reranker Family Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation PaddleOCR 3.5: Running OCR and Document Parsing Tasks with a Transformers Backend The Open Agent Leaderboard Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality Unlocking asynchronicity in continuous batching Building Blocks for Foundation Model Training and Inference on AWS MachinaCheck: Building a Multi-Agent CNC Manufacturability System on AMD MI300X "OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support" CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models EMO: Pretraining mixture of experts for emergent modularity MedQA: Fine-Tuning a Clinical AI on AMD ROCm — No CUDA Required vLLM V0 to V1: Correctness Before Corrections in RL Adding Benchmaxxer Repellant to the Open ASR Leaderboard AI evals are becoming the new compute bottleneck Granite 4.1 LLMs: How They’re Built DeepInfra on Hugging Face Inference Providers 🔥 Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents Adaptive Ultrasound Imaging with Physics-Informed NV-Raw2Insights-US AI How to build scalable web apps with OpenAI's Privacy Filter DeepSeek-V4: a million-token context that agents can actually use How to Use Transformers.js in a Chrome Extension Gemma 4 VLA Demo on Jetson Orin Nano Super QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas AI and the Future of Cybersecurity: Why Openness Matters Building a Fast Multilingual OCR Model with Synthetic Data Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers The PR you would have opened yourself Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents Meet HoloTab by HCompany. Your AI browser companion. Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs Multimodal Embedding & Reranker Models with Sentence Transformers ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Welcome Gemma 4: Frontier multimodal intelligence on device Holo3: Breaking the Computer Use Frontier Falcon Perception Any Custom Frontend with Gradio's Backend Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Training mRNA Language Models Across 25 Species for $165 TRL v1.0: Post-Training Library Built to Move with the Field Liberate your OpenClaw A New Framework for Evaluating Voice Agents (EVA) Build a Domain-Specific Embedding Model in Under a Day State of Open Source on Hugging Face: Spring 2026 Holotron-12B - High Throughput Computer Use Agent Introducing Storage Buckets on the Hugging Face Hub Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries Ulysses Sequence Parallelism: Training with Million-Token Contexts LeRobot v0.5.0: Scaling Every Dimension Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations Introducing Modular Diffusers - Composable Building Blocks for Diffusion Pipelines PRX Part 3 — Training a Text-to-Image Model in 24h! Mixture of Experts (MoEs) in Transformers GGML and llama.cpp join HF to ensure the long-term progress of Local AI Train AI models with Unsloth and Hugging Face Jobs for FREE IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST One-Shot Any Web App with Gradio's gr.HTML Custom Kernels for All from Codex and Claude OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments Transformers.js v4: Now Available on NPM! Introducing SyGra Studio Community Evals: Because we're done trusting black-box leaderboards over the community H Company's new Holo2 model takes the lead in UI Localization The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+ Training Design for Text-to-Image Models: Lessons from Ablations Introducing Daggr: Chain apps programmatically, inspect visually We Got Claude to Build CUDA Kernels and teach open models! Architectural Choices in China's Open-Source AI Ecosystem: Building Beyond DeepSeek Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality One Year Since the “DeepSeek Moment” Differential Transformer V2 Introducing Waypoint-1: Real-time interactive video diffusion from Overworld Open Responses: What you need to know NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI Introducing Falcon-H1-Arabic: Pushing the Boundaries of Arabic Language AI with Hybrid Architecture NVIDIA brings agents to life with DGX Spark and Reachy Mini AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems Tokenization in Transformers v5: Simpler, Clearer, and More Modular The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Codex is Open Sourcing AI models Introducing swift-huggingface: The Complete Swift Client for Hugging Face DeepMath: A lightweight math reasoning Agent with smolagents We Got Claude to Fine-Tune an Open Source LLM Transformers v5: Simple model definitions powering the AI ecosystem Diffusers welcomes FLUX-2 Continuous batching from first principles Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks Introducing AnyLanguageModel: One API for Local and Remote LLMs on Apple Platforms
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
2026-05-23 · via Hugging Face - Blog

Back to Articles

Mehran Maghoumi's avatar

Yonggan Fu's avatar

Pavlo Molchanov's avatar

Khadkevich's avatar

1-headline-final

Large language models (LLMs) have become the default interface for code generation, math problem solving, summarization, document understanding, and many other developer workflows. Under the hood, though, many LLMs still generate text the same way: one token at a time, and each token depends on the tokens that appeared before it. As such, these models are called autoregressive, since they consume their own outputs.

That autoregressive (AR) approach has been remarkably successful. It is stable to train, simple to serve, and responsible for much of the progress in modern language modeling. But it also creates a hard limit: every new token requires a full model pass and every weight has to be loaded from the memory before computation can start. For developers building latency-sensitive applications, running smaller batch sizes, or trying to make better use of modern GPUs, token-by-token generation can leave performance on the table as most of the GPU’s time is spent on memory operations, rather than computation.

Additionally, once a token is generated by an autoregressive model, it is final and they do not inherently have the ability to revise previous tokens. Consequently, mistakes can propagate during the course of generation.

Nemotron-Labs Diffusion introduces a new path forward: diffusion language models (DLM) that work by generating multiple tokens in parallel, then iteratively refining the generated tokens in multiple steps. Not only can these models better leverage the computational model of the modern GPUs and offer significant runtime performance benefits, but they can also revise generated tokens, making them more suitable for revising existing text and addressing fill-in-the-middle objectives. This generate-and-refine property also offers a built-in way to control the inference budget. By reducing the number of refinement steps, one can reduce the compute requirements of these models at runtime.

Quick Links to the Models, Training Recipe and Technical Report

The Nemotron-Labs Diffusion family includes text models at 3B, 8B, and 14B scales, all available under the commercially-friendly NVIDIA Nemotron Open Model License, as well as a 8B scale vision-language model (VLM), available under the NVIDIA Source Code License, granting broad research flexibility. Across the lineup, NVIDIA is releasing both base models and instruction-tuned chat variants. NVIDIA is also releasing the code for training these models through the NVIDIA Megatron Bridge framework.

Three Generation Modes in One Model

2-tri-mode-final

Nemotron-Labs Diffusion is designed around a simple idea: autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model. The model supports three generation modes:

Autoregressive mode runs like a standard left-to-right LLM. This keeps compatibility with the generation workflow developers already know.

Diffusion mode generates block by block, gradually generating tokens over multiple steps.

Self-speculation mode uses diffusion to draft multiple candidate tokens, then uses autoregressive decoding to verify them. This combines the speed potential of diffusion-style drafting with the reliability of AR verification.

This flexible design is the key developer-facing feature where speed and accuracy both matter, even at workloads with unpredictable batch sizes, or those with a single query (batch size=1). Selecting the desired inference mode requires almost no change at the application level, since this is a deployment-time setting. As such, developers can seamlessly switch between the model they use today, or Nemotron-Labs Diffusion in various inference modes for ultra-fast generation speeds.

Performance Highlights

Screenshot from 2026-05-22 15-49-43

Nemotron-Labs Diffusion 8B achieves an improved average accuracy of 1.2% compared with Qwen3 8B. Comparing the inference speed measured in tokens per forward pass (TPF for short, a hardware-agnostic means of measuring token decoding efficiency), the diffusion mode reaches 2.6× higher TPF than AR models, while self-speculation pushes that further to 6× for linear self-speculation and 6.4× for quadratic self-speculation, with comparable accuracy across the evaluated tasks.

How we trained Nemotron-Labs Diffusion

Diffusion language models have been promising for years, but they have historically had practical barriers: lower accuracy than strong AR models, more difficult training, and limited compatibility with KV caching.

Recent work changed that direction. Efficient-DLM showed that pretrained AR models can be converted into diffusion language models through continued pretraining and altering the attention mechanism to a block-wise approach. That design helps preserve AR model capabilities while enabling KV-cache-friendly parallel decoding.

Nemotron-Labs Diffusion builds on the same practical insight: add diffusion capabilities to an existing AR model. The model was trained with a joint AR and diffusion objective, allowing it to retain what it had learned during its initial AR training while diffusion added parallel drafting capability. The model was pre-trained on 1.3T tokens from the NVIDIA Nemotron Pretraining datasets and underwent an additional supervised fine-tuning phase using 45B tokens from the NVIDIA Nemotron Post-training datasets.

Deployment and inference through SGLang

Deployment of Nemotron-Labs Diffusion models will soon be supported in the main branch of SGLang. At the time of this writing, support for inference is available through this issue tracker request on GitHub.

What's neat is that the integration lets you serve the same checkpoint in three different ways, picked by a single line in your algorithm config:

  • Plain autoregressive - set ar_mode=true and the model behaves like any other causal LM. Useful as the correctness reference, or if you just want a sanity check against pure AR output.

  • Diffusion mode (FastDiffuser) - the headliner for raw throughput. The model fills in a 32-token block at a time by iteratively denoising it, and a confidence threshold decides which tokens are "good enough" to commit each step.

  • Self-speculation (LinearSpec) - this one's our favorite. The same model drafts a block bidirectionally, then verifies it causally; whatever prefix matches gets committed. Output is lossless versus AR at temperature 0, but we hit ~865 tok/s on B200 on speedbench dataset - roughly 4× the autoregressive baseline on the same hardware.

Get Started Today

Nemotron-Labs Diffusion brings diffusion-style generation into a form developers can actually use: open models, familiar AR compatibility, diffusion decoding, and self-speculative acceleration in one family. With Nemotron-Labs Diffusion, developers get a new way to draft, refine, verify, and accelerate text generation, without needing to alter their applications.

To get started, explore the Nemotron-Labs Diffusion model family, read the technical report, and try the available training recipe.