惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
D
Docker
博客园 - 聂微东
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
博客园 - 叶小钗
李成银的技术随笔
Hugging Face - Blog
Hugging Face - Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
大猫的无限游戏
大猫的无限游戏
Jina AI
Jina AI
罗磊的独立博客
小众软件
小众软件
月光博客
月光博客
量子位
雷峰网
雷峰网
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - Franky
The Cloudflare Blog
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog RSS Feed
Last Week in AI
Last Week in AI
J
Java Code Geeks
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
宝玉的分享
宝玉的分享
H
Help Net Security
腾讯CDC
T
ThreatConnect
Cyberwarzone
Cyberwarzone
S
Securelist
A
Arctic Wolf
B
Blog
有赞技术团队
有赞技术团队
Y
Y Combinator Blog
Stack Overflow Blog
Stack Overflow Blog
A
About on SuperTechFans
F
Fox-IT International blog
P
Proofpoint News Feed
The Register - Security
The Register - Security
G
GRAHAM CLULEY
C
CXSECURITY Database RSS Feed - CXSecurity.com
阮一峰的网络日志
阮一峰的网络日志
P
Privacy & Cybersecurity Law Blog
美团技术团队
博客园 - 司徒正美
Apple Machine Learning Research
Apple Machine Learning Research
Security Latest
Security Latest
F
Full Disclosure
Recent Commits to openclaw:main
Recent Commits to openclaw:main
L
Lohrmann on Cybersecurity

DEV Community

HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I built an AI faceless video generator in 2 months — here's the stack llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it) From Abandoned Prototype to Smart AI System: Reviving Trafiq AI with GitHub Copilot Why Country/State/City Pickers Are Weirdly Hard Node.js 22 LTS — EOL Date, Support Timeline, and What Comes Next The 7-Layer Memory Architecture Behind Modern AI Agents I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI One backend, four products: why we bet on platform-per-brand AI's tech debt is invisible — even to AI. I solved it at the architecture layer. Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals You Don’t Need to Try Every AI Tool to Keep Up NovelPilot: A Novel Writing Agent Powered by Gemma 4 BoxAgnts is an Out-Of-The-Box Secure AI Agent ToolBox in a WASM SandBox Gemma 4 deep dive: why a 1.5 GB model scores 37.5% on competition mathematics, how the MoE routing actually works, and which model fits your hardware. Full breakdown inside. BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090 Google Just Declared the Chat-Log Interface Dead. Here's What Neural Expressive Actually Signals for Developers. ARCHITECTURE SPECIFICATION & FORMAL SYSTEM REPORT: k501-AIONARC Notes from a Hammock What's Google Antigravity 2.0 ? Here's What the Agent Harness Actually Changes for Developers. Building an E2EE Chat App in Flask - Part 3: Keeping File Uploads Safe Google's Gemini Spark. Here's What It Actually Does for Developers. Microsoft Just Shipped MCP Governance for .NET. Here's What It Actually Enforces. How I Built a Pakistan Internet Speed Test Platform at 16 How to Build a Supervisor Agent Architecture Without Frameworks I Built My Own Corner of the Internet — Here's What It Looks Like How does VuReact compile Vue 3's defineExpose() to React? Neo-VECTR's Rift Ascent Idempotency Keys: The API Safety Net You Probably Aren't Using Building E-Commerce Sites for Niche Products: Technical Lessons from Specialty Outdoor Retailers Audit Logs: The Silent Guardian of Every Serious System Open-source SDS tooling for Japanese MHLW compliance: the gap nobody filled BetAGracevI I Built a Post-Quantum Cryptographic Identity SDK for AI Agents — Here's Why It Needs to Exist Running Claude Code across multiple repos without losing context There Are Cameras in Every Room of My House. I Put Them There. Why your AI agent loops forever (and how to break the cycle) How does VuReact compile Vue 3's defineSlots() to React? Building a Privacy-First Resume Editor with Typst WASM and React One Soul, Any Model: Portable Memory for Open-Source Agents with .klickd From Pixels to Prescriptions: Building an Autonomous Healthcare Booking Agent with LangGraph MonoGame - A Game Engine for Those Who Love Reinventing the Wheel # Day 24: In Solana, Everything is an Account Mastering Node.js HTTP Module: Build Servers, REST APIs, and Handle Requests Mastering Node.js HTTP Module: Build Servers, REST APIs, and Handle Requests RP2040 Wristwatch Tells Time With a Vintage VU Meter Needle observations about models / 2026, may From Video Transcripts to Source-Grounded AI Notes: A Practical Look at Notesnip AI Agent Dev Environment Guide — Real Experience from an AI Living Inside a Server How I Run 7 AI Models 24/7: Multi-Agent Architecture in Practice What exactly changes with the Claude Max plan? I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible OpenAI's $2M-tokens-for-equity YC deal, decoded Why DMX Infrastructure is Still Stuck in the 90s Agent Series (2): ReAct — The Most Important Agent Reasoning Paradigm Open Source Project (No.73): Sub2API - All-in-One Claude/OpenAI/Gemini Subscription-to-API Relay I Made the Wrong Bet on Event Streaming in Our Treasure Hunt Engine #ai #productivity #chatgpt #python Symbolic Constant Conundrum From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM Building an outbound-only WebSocket bridge for local AI agents Our System's Sins in Ghana: Why We Had to Rethink Digital Product Sales Execution Governance, AI Drift, and the Security Paradox of Runtime Enforcement Differential Pair Impedance: Why USB and HDMI Routing Is a Geometry Problem Small AI database questions can become big scans Claude Code 2.1 Agent View & /goal: Autonomous Dev Guide 2026 Your AI database agent should not see every column Rust's Low-Latency Conquest: Why We Ditched C++ for a Treasure Hunt Engine Floating-point will quietly corrupt your emissions math, and 0.1 + 0.2 already warned you Autonomous Agents: what breaks first (and why that's the real product) [2026-05-23] Agent payments are the new cloud bill footgun ORA-00069 오류 원인과 해결 방법 완벽 가이드 How I Built a Local, Multimodal Gemma 4 Visual Regression & Patch Agent: Closed-Loop Validation, Canvas Pixel Diffing, and Reproducible Benchmarks Pressure-testing Ota on Supabase: from setup prose to executable repo readiness VPC CNI en EKS: cómo dejar de pagar nodos que no usás The Future of Text Analysis: Introducing TechnoHelps Semantic Engine I built a Chrome Extension that saves product images + context directly to Google Drive & Sheets 95+ browser-based dev tools that never touch a server Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama From a 10,000-line OpenSearch export script to a log analysis tool Ghost Bugs Cost $40K: A Neural Debugging Postmortem SECPAC: A Lightweight CLI Tool to Password-Protect Your Environment Variables 🚀 PasteCheck v1.7 + v1.8 — Hints that tell you what to fix, and a nudge panel that tells you where to start 8 Real Ways Developers Make Money in 2026 (Ranked by Effort) I built a free AI-powered Git CLI that writes your commit messages for you sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs OpenLiDARViewer: A Browser-Based LiDAR and Point-Cloud Viewer Local-First Browser Tools: What You Should Not Upload Online Why most freelancers undercharge (and the maths behind fixing it) We built a mahjong dangerous-tile predictor calibrated on 4.97M real hands Building a Chord Progression Generator in the Browser — Music Theory in JS, Sound via Web Audio API tutorial #10: 148 Opens, 0 Replies — How My Forge Cold Email v1 Completely Failed 9 in 10 Docker Compose files skip the basic security flags How to Forward Android SMS to Telegram Automatically I built the first security scanner for MCP servers — here's what I found
Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling
Manoranjan R · 2026-05-23 · via DEV Community

Meta Description: Diffusion language models (DLMs) are rewriting LLM inference. Dive deep into NVIDIA's Nemotron-Labs Diffusion — how block-wise attention, AR-to-DLM conversion, and self-speculation modes achieve 6.4× throughput gains over autoregressive models with better accuracy.

Diffusion Language Models: How NVIDIA's Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling

Published: May 23, 2026 | Focus Keyword: diffusion language models | Estimated Read Time: 14 minutes


Hero Banner — Diffusion Language Models


Table of Contents

  1. The Token-by-Token Tax: Why Your LLM Is Leaving GPU Performance on the Table
  2. Background: The Autoregressive Wall
  3. What Are Diffusion Language Models? The Full Mental Model
  4. The AR-to-DLM Conversion Breakthrough
  5. Nemotron-Labs Diffusion: Architecture and Three Generation Modes
  6. Performance Deep Dive: Benchmarks and What They Actually Mean
  7. Hands-On: Loading and Running Nemotron-Labs Diffusion
  8. Practical Engineering Considerations
  9. The Bigger Picture: What DLMs Mean for the LLM Ecosystem
  10. Conclusion: A Paradigm Shift Worth Acting On

1. The Token-by-Token Tax

Imagine you hired the world's fastest typist — but forced them to pause after every single character to re-read the entire document before typing the next one. That, in essence, is what your autoregressive LLM is doing on your GPU right now.

Every token generated by a standard transformer LLM requires a full forward pass through all model weights. Every weight must be loaded from GPU HBM (high-bandwidth memory) into the compute cores before a single multiply-accumulate can happen. At batch size 1 — the regime of interactive applications, code assistants, and real-time agents — your multi-billion parameter model is nearly 100% memory-bandwidth bound. The thousands of CUDA cores sitting idle while waiting for memory reads are the silent tax every LLM deployment pays.

This isn't a new observation. It's been the defining bottleneck of LLM serving since GPT-2. Hardware vendors have thrown HBM3, NVLink, and ever-wider memory buses at the problem, but the fundamental constraint remains: autoregressive decoding serializes computation in a way that fundamentally under-utilizes modern parallel hardware.

On May 23, 2026, NVIDIA released Nemotron-Labs Diffusion — a family of diffusion language models (DLMs) that attacks this problem at the architecture level. The models generate entire blocks of tokens in parallel, then iteratively refine them, rather than committing to one token at a time. The result: up to 6.4× higher throughput than equivalent autoregressive baselines, with accuracy that exceeds comparable AR models.

This post is a deep technical dive into how diffusion language models work, what makes NVIDIA's approach different, and how you can start using them today.


2. Background: The Autoregressive Wall

AR vs DLM token generation comparison

To appreciate why diffusion language models matter, you need to understand precisely why autoregressive models hit a wall — and it's worth being specific, because the bottleneck is not where many engineers assume it is.

The Memory Bandwidth Problem

Modern LLMs are what inference engineers call memory-bandwidth bound at low batch sizes. Consider an 8B parameter model in BF16: that's roughly 16 GB of weight data. At batch size 1, generating a single token requires reading the vast majority of those 16 GB through the memory hierarchy. An H100 has ~3.35 TB/s of HBM bandwidth, which sounds fast — but reading 16 GB still takes roughly 4.8 ms of pure memory time. At batch size 1, you're looking at a theoretical ceiling of ~208 tokens/second purely from memory bandwidth limits, and that's before accounting for compute.

Increase the batch size and you amortize those memory reads across multiple sequences — but that trades per-request latency for throughput, which is the wrong tradeoff for interactive applications.

The Irreversibility Problem

There's a second, more subtle pathology in autoregressive generation: tokens are final once generated. If the model emits a poor token early in a sequence, all subsequent tokens are conditioned on that mistake. The only mitigation is beam search or sampling with temperature — techniques that add compute overhead without eliminating the root cause.

This is particularly painful in fill-in-the-middle (FIM) tasks — think code completion in the middle of a function — where the model needs to generate text that is coherent with both the preceding and following context simultaneously. Autoregressive models handle FIM by training on rearranged sequences or via special tokens, but they still decode left-to-right, never able to naturally revise a poor early commitment.

The KV Cache Ceiling

The KV cache is a standard optimization that stores key-value pairs from prior tokens to avoid recomputing them on every step. But it introduces its own scaling constraints: KV cache size grows linearly with sequence length and batch size. On a single A100-80GB, serving a 32k-context 70B model at batch size 8 can exhaust GPU memory entirely just from KV cache — forcing degraded batch sizes or context truncation.

These three problems — memory bandwidth, irreversibility, and KV cache pressure — are structural features of autoregressive decoding. Patching any one of them with engineering hacks (speculative decoding, flash attention, quantization) provides incremental relief. Diffusion language models address all three simultaneously at the architecture level.


3. What Are Diffusion Language Models? The Full Mental Model

DLM iterative denoising process

If you've worked with diffusion models for images (Stable Diffusion, DALL·E, Flux), you have the right mental model — with one critical adaptation for the discrete nature of text.

Image Diffusion vs. Text Diffusion

Image diffusion models work by:

  1. Forward process: Progressively add Gaussian noise to an image until it becomes pure noise
  2. Reverse process: Learn to iteratively denoise, recovering the original image step by step

For text, you can't add continuous Gaussian noise to discrete tokens. Instead, discrete diffusion models use a masking process:

  1. Forward process (masking): Progressively replace tokens with a special [MASK] token
  2. Reverse process (demasking): Learn to predict and fill in masked tokens, starting from a fully masked sequence

At inference time, you start with a fully masked target sequence. The model fills in token predictions across the entire sequence simultaneously, with low-confidence predictions remaining masked for subsequent refinement steps. After a fixed number of denoising steps (typically 10–50), the sequence has converged to a complete, coherent output.

Why This Beats AR for Throughput

The throughput gain is structural. In AR decoding:

  • N tokens = N forward passes
  • Each forward pass processes 1 new token (plus KV cache for context)

In DLM decoding with a block size of 32:

  • 32 tokens = 1 forward pass (first pass fills all 32 positions simultaneously)
  • Subsequent passes refine uncertain tokens in the same block
  • With high model confidence, convergence happens in very few steps

The total compute is not necessarily lower — each DLM forward pass over a 32-token block processes more tokens simultaneously — but the parallelism maps much better to GPU hardware. Instead of memory-bound sequential reads, you get compute-bound matrix multiplications across full blocks, which is exactly what GPUs are designed for.

Bidirectional Attention: The Secret Sauce

AR models use causal (unidirectional) attention: each token can only attend to tokens that precede it. This enforces the left-to-right generation constraint at the architecture level.

DLMs use bidirectional attention within each generated block: every masked token can attend to every other token (masked or unmasked) in its context window simultaneously. This is what allows a DLM to generate tokens 1, 8, 15, and 27 of a 32-token block in one pass, each informed by the others — something architecturally impossible in an AR model.


4. The AR-to-DLM Conversion Breakthrough

The conceptual appeal of diffusion language models has existed for years. What stopped them from displacing autoregressive models was a hard practical barrier: training DLMs from scratch is catastrophically expensive.

An AR model learns a single conditional distribution P(token_t | token_1...t-1). A DLM must learn to denoise from any possible masking pattern — effectively learning P(token | any subset of other tokens). The number of possible masking patterns for a sequence of length N is 2^N. This combinatorial explosion means DLMs trained from scratch require orders of magnitude more data and compute to reach the same accuracy as AR models.

The NVIDIA Efficient-DLM Paper: The Key Insight

The breakthrough came from NVIDIA Research's Efficient-DLM paper (arXiv:2512.14067). The core insight:

You don't need to train DLMs from scratch. You can convert a pretrained AR model into a DLM via continued pretraining at a fraction of the original training cost.

A pretrained AR model has already learned rich representations of language structure, grammar, facts, and reasoning — all the hard semantic work. Converting it to support diffusion-style generation requires teaching it a new decoding mechanism, not new language knowledge.

The paper demonstrated this conversion requires only ~10 billion tokens of continued pretraining (versus the trillions needed from scratch) to achieve competitive accuracy. Extended training on ~100B tokens enables more aggressive parallel generation.

Block-Wise Attention: Preserving AR Weight Distributions

The first key technical contribution is the block-wise attention pattern. Rather than switching to fully bidirectional attention (which radically changes the attention structure and destroys the AR model's learned weight distributions), block-wise attention:

  • Maintains causal attention across blocks (block 2 cannot attend to tokens in block 3)
  • Enables bidirectional attention within each block (tokens within block 2 attend to each other freely)

This is a critical nuance. Fully bidirectional attention during conversion causes catastrophic forgetting — the model's pretrained weights "remember" causal attention patterns, and switching to full bidirectionality creates a mismatch that degrades accuracy. Block-wise attention preserves the causal structure across the sequence while enabling the parallel within-block generation that drives throughput.

A simplified view of the block-wise attention mask looks like this:

import torch

def block_wise_attention_mask(seq_len: int, block_size: int) -> torch.Tensor:
    """
    Creates a block-wise attention mask for DLM conversion.
    - Causal across blocks: block i cannot attend to block j > i
    - Bidirectional within each block: all tokens in block i attend to each other

    Args:
        seq_len: Total sequence length
        block_size: Size of each attention block

    Returns:
        Boolean mask of shape (seq_len, seq_len)
        True = position is attended to, False = masked out
    """
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)

    num_blocks = (seq_len + block_size - 1) // block_size

    for block_idx in range(num_blocks):
        block_start = block_idx * block_size
        block_end = min(block_start + block_size, seq_len)

        # Each token in this block can attend to:
        # 1. All tokens in ALL previous blocks (causal cross-block)
        # 2. All tokens WITHIN this block (bidirectional intra-block)

        for pos in range(block_start, block_end):
            # Attend to all previous blocks
            mask[pos, :block_start] = True
            # Attend to all positions within current block (bidirectional)
            mask[pos, block_start:block_end] = True

    return mask

# Example: 16-token sequence, block size 4
mask = block_wise_attention_mask(seq_len=16, block_size=4)
print(f"Mask shape: {mask.shape}")
print(f"Non-zero fraction: {mask.float().mean():.2%}")

# Visualize the mask structure
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
plt.imshow(mask.numpy(), cmap='Blues', interpolation='nearest')
plt.title('Block-Wise Attention Mask (seq=16, block=4)\nBlue = attended, White = masked')
plt.xlabel('Key position')
plt.ylabel('Query position')
for i in range(0, 16, 4):
    plt.axhline(i - 0.5, color='red', linewidth=1.5)
    plt.axvline(i - 0.5, color='red', linewidth=1.5)
plt.tight_layout()
plt.savefig('/tmp/block_attn_mask.png', dpi=150)
print("Block attention mask visualization saved.")

Enter fullscreen mode Exit fullscreen mode

Position-Dependent Token Masking: Closing the Train-Test Gap

The second key contribution addresses a subtle training-test distribution mismatch.

During training, masked language models typically use uniform random masking — each token is independently masked with probability p (e.g., 15% for BERT). But at inference time, a DLM uses confidence-based progressive unmasking: high-confidence tokens are committed first, and low-confidence tokens remain masked for refinement.

The problem: because language has strong left-to-right structure, confidence scores are heavily skewed toward earlier tokens in the sequence. The DLM's test-time behavior looks nothing like the uniform masking it was trained on — early tokens get committed immediately, later tokens stay masked longer.

NVIDIA's solution: position-dependent masking probability. During training, tokens at position p in a block are masked with probability:

P_mask(p) = base_prob + (p / block_size) * increase_factor

Enter fullscreen mode Exit fullscreen mode

Later positions in a block get higher masking probabilities during training, better matching the left-to-right confidence distribution observed at inference. This seemingly simple change produced significant accuracy improvements across math, coding, and commonsense reasoning benchmarks.


5. Nemotron-Labs Diffusion: Architecture and Three Generation Modes

Three generation modes diagram

Building on the Efficient-DLM research, NVIDIA released the Nemotron-Labs Diffusion model family today (May 23, 2026) — the first production-scale DLM family designed for real developer use.

The Model Family

Model Parameters Type License HF Downloads (launch day)
Nemotron-Labs-Diffusion-3B 3B Text NVIDIA Nemotron Open 14.2k
Nemotron-Labs-Diffusion-8B 8B Text NVIDIA Nemotron Open 19.7k
Nemotron-Labs-Diffusion-14B 14B Text NVIDIA Nemotron Open 1.99k
Nemotron-Labs-Diffusion-VLM-8B 9B Vision-Language NVIDIA Source Code 359

All text models come in both base and instruction-tuned chat variants. The VLM-8B extends diffusion generation to vision-language tasks — a first for DLMs at this scale.

Training details:

  • Pre-training: 1.3 trillion tokens on NVIDIA Nemotron Pretraining datasets
  • Supervised fine-tuning: 45 billion tokens on NVIDIA Nemotron Post-training datasets v3
  • Base model: Converted from a pretrained AR model using the Efficient-DLM methodology

Mode 1: Autoregressive (AR Mode)

# Enable AR mode via SGLang config
sampling_params = {
    "ar_mode": True,          # Plain autoregressive decoding
    "temperature": 0.7,
    "max_new_tokens": 512,
}

Enter fullscreen mode Exit fullscreen mode

In AR mode, the DLM behaves identically to a standard causal LM. Every token is generated left-to-right, conditioning on all prior tokens. This mode exists primarily as a correctness baseline and for backward compatibility — if you're migrating an existing AR pipeline, you can validate the DLM produces equivalent outputs before switching to faster modes.

When to use: Regression testing, maximum output quality verification, tasks where exact AR parity is required.

Mode 2: FastDiffuser (Diffusion Mode)

# FastDiffuser: parallel block generation with confidence-threshold commitment
sampling_params = {
    "ar_mode": False,
    "diffusion_mode": "fast_diffuser",
    "block_size": 32,          # Tokens generated in parallel per block
    "confidence_threshold": 0.9,  # Commit tokens above this confidence
    "max_denoising_steps": 20,    # Maximum refinement iterations per block
    "temperature": 0.7,
    "max_new_tokens": 512,
}

Enter fullscreen mode Exit fullscreen mode

FastDiffuser fills in a 32-token block by iteratively denoising it. At each step:

  1. The model scores every masked position and produces a probability distribution
  2. Tokens above the confidence threshold are "committed" (unmasked permanently)
  3. Remaining low-confidence positions stay masked for the next denoising step
  4. Repeat until all positions in the block are committed or max_denoising_steps is reached

This mode achieves 2.6× higher Tokens Per Forward Pass (TPF) vs. AR baselines — a hardware-agnostic throughput metric that normalizes across GPU generations.

When to use: Batch inference, high-throughput serving, streaming completions where some latency increase is acceptable in exchange for throughput gains.

Mode 3: Self-Speculation (LinearSpec / QuadSpec)

Self-speculation is the most technically sophisticated mode and the biggest headline of the Nemotron-Labs release. It combines diffusion drafting with AR verification in a lossless hybrid:

# LinearSpec: diffusion drafts, AR verifies — lossless at temperature=0
sampling_params = {
    "ar_mode": False,
    "diffusion_mode": "linear_spec",   # or "quad_spec" for even higher TPF
    "block_size": 32,
    "temperature": 0.0,                # Lossless vs AR at temp=0
    "max_new_tokens": 512,
}

Enter fullscreen mode Exit fullscreen mode

The self-speculation algorithm:

  1. Draft phase: The DLM generates a candidate block bidirectionally using diffusion mode
  2. Verify phase: The same model verifies the draft causally in a single AR forward pass
  3. Commit: The longest verified prefix that matches AR output is committed
  4. Iterate: Repeat from the first unverified token

At temperature=0, LinearSpec output is mathematically identical to AR output — there is no quality degradation. The speed comes entirely from the fact that the diffusion draft often predicts correctly, and the AR verification pass commits many tokens in a single pass. On NVIDIA B200 hardware running the SpeedBench dataset, LinearSpec hits ~865 tokens/second, approximately 4× the AR baseline on the same hardware.

QuadSpec takes this further with a quadratic verification strategy, achieving 6.4× TPF over AR at the cost of slightly higher compute per accepted token — optimal for maximum throughput scenarios.

When to use: Any production deployment where you want AR-quality output but maximum speed. Self-speculation is strictly better than plain AR at temperature=0.


6. Performance Deep Dive

Understanding Tokens Per Forward Pass (TPF)

NVIDIA benchmarks Nemotron-Labs Diffusion using Tokens Per Forward Pass (TPF) rather than raw tokens-per-second. This is a deliberate, hardware-agnostic choice: raw tok/s varies with GPU clock speeds, batch sizes, and infrastructure — making cross-hardware comparison misleading. TPF normalizes for hardware by measuring how many output tokens are effectively generated per model forward pass.

Mode TPF (vs AR baseline) Tokens/sec on B200 Quality vs AR
Autoregressive 1× (baseline) ~215 tok/s Baseline
FastDiffuser 2.6× ~560 tok/s Comparable
LinearSpec ~4× ~865 tok/s Lossless at temp=0
QuadSpec 6.4× ~1,375 tok/s (est., verify before publishing) Comparable

Accuracy: Not a Tradeoff

A common assumption when optimizing inference is that speed comes at an accuracy cost. Nemotron-Labs Diffusion breaks this assumption:

  • Nemotron-Labs Diffusion 8B achieves +1.2% higher average accuracy compared to Qwen3 8B on a suite of math, coding, and reasoning benchmarks
  • Efficient-DLM 8B (the research model that Nemotron-Labs builds on) achieves +5.4% higher accuracy than Dream 7B with 4.5× higher throughput, and +2.7% accuracy over Qwen3 4B with 2.7× throughput

The accuracy improvements are attributed to: (a) the iterative refinement capability — the model can "reconsider" uncertain early tokens, (b) the bidirectional within-block context — tokens benefit from both preceding and following context when generated, and (c) the larger effective training compute on the Nemotron pretraining datasets.


7. Hands-On Guide

Getting started with Nemotron-Labs Diffusion requires either the HuggingFace transformers library (for standard inference) or SGLang (for production serving with mode switching). Here's a practical end-to-end guide:

Installation

# Core dependencies
pip install transformers>=4.45.0 torch>=2.4.0 accelerate

# For SGLang production serving
# NOTE: DLM mode support is in active PR #25803 — check merge status before using
pip install "sglang[all]>=0.4.0"

# For visualization and benchmarking
pip install matplotlib numpy tqdm

Enter fullscreen mode Exit fullscreen mode

Basic Inference with HuggingFace Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time

MODEL_ID = "nvidia/Nemotron-Labs-Diffusion-8B"

# Load tokenizer and model
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

print("Loading model (this may take a few minutes)...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,   # BF16 for optimal performance
    device_map="auto",              # Automatically distributes across available GPUs
    trust_remote_code=True,
)
model.eval()
print(f"Model loaded on: {next(model.parameters()).device}")

# Prepare a prompt
prompt = """<|system|>
You are a helpful assistant specializing in systems programming.
<|user|>
Write a Python function that implements a lock-free ring buffer using atomic operations.
<|assistant|>"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs["input_ids"].shape[1]

# --- Standard AR generation (baseline) ---
print("\n[AR Mode] Generating...")
start = time.perf_counter()
with torch.no_grad():
    ar_output = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,          # Greedy decoding
        temperature=1.0,
    )
ar_time = time.perf_counter() - start
ar_tokens = ar_output.shape[1] - input_length
print(f"AR: {ar_tokens} tokens in {ar_time:.2f}s ({ar_tokens/ar_time:.1f} tok/s)")
print(tokenizer.decode(ar_output[0][input_length:], skip_special_tokens=True))

Enter fullscreen mode Exit fullscreen mode

SGLang Production Serving with Mode Switching

# server_launch.py — Launch Nemotron-Labs Diffusion via SGLang
# Requires sglang with DLM support (PR #25803 merged)

import sglang as sgl
from sglang import RuntimeEndpoint

# Launch the model server — single config serves all three modes
runtime = sgl.Runtime(
    model_path="nvidia/Nemotron-Labs-Diffusion-8B",
    dtype="bfloat16",
    tensor_parallel_size=1,     # Increase for multi-GPU
    trust_remote_code=True,
)

@sgl.function
def generate_ar(s, prompt: str):
    """Autoregressive mode — maximum compatibility"""
    s += sgl.system("You are a helpful technical assistant.")
    s += sgl.user(prompt)
    s += sgl.assistant(
        sgl.gen(
            "response",
            max_new_tokens=512,
            ar_mode=True,           # Key flag: enables AR mode
        )
    )

@sgl.function  
def generate_fast_diffuser(s, prompt: str):
    """FastDiffuser mode — 2.6x throughput"""
    s += sgl.system("You are a helpful technical assistant.")
    s += sgl.user(prompt)
    s += sgl.assistant(
        sgl.gen(
            "response",
            max_new_tokens=512,
            ar_mode=False,
            diffusion_mode="fast_diffuser",
            block_size=32,
        )
    )

@sgl.function
def generate_self_spec(s, prompt: str):
    """Self-speculation LinearSpec — ~4x throughput, lossless at temp=0"""
    s += sgl.system("You are a helpful technical assistant.")
    s += sgl.user(prompt)
    s += sgl.assistant(
        sgl.gen(
            "response",
            max_new_tokens=512,
            ar_mode=False,
            diffusion_mode="linear_spec",
            temperature=0.0,        # Lossless output vs AR at temp=0
        )
    )

# Benchmark all three modes
import time

test_prompt = "Explain the memory ordering semantics of std::atomic in C++ and when to use memory_order_acquire vs memory_order_seq_cst."

with runtime:
    for mode_name, fn in [("AR", generate_ar), ("FastDiffuser", generate_fast_diffuser), ("LinearSpec", generate_self_spec)]:
        start = time.perf_counter()
        state = fn.run(prompt=test_prompt)
        elapsed = time.perf_counter() - start
        response = state["response"]
        tok_count = len(response.split())  # Approximate
        print(f"\n[{mode_name}] ~{tok_count} tokens in {elapsed:.2f}s")
        print(f"Preview: {response[:200]}...")

Enter fullscreen mode Exit fullscreen mode

Fill-in-the-Middle (FIM): Where DLMs Shine

One of the most compelling DLM use cases is fill-in-the-middle code completion — generating code that must be coherent with both preceding and following context. DLMs handle this naturally:

# FIM inference — DLMs are architecturally suited for this task
fim_prompt = """<|fim_prefix|>
def binary_search(arr: list[int], target: int) -> int:
    \"\"\"
    Search for target in a sorted array.
    Returns the index if found, -1 otherwise.
    Time complexity: O(log n)
    \"\"\"
    left, right = 0, len(arr) - 1

<|fim_suffix|>

    return -1  # Target not found
<|fim_middle|>"""

inputs = tokenizer(fim_prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=False,
    )

generated = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("FIM completion:")
print(generated)
# Expected: while left <= right: mid = (left + right) // 2 ...

Enter fullscreen mode Exit fullscreen mode


8. Practical Engineering Considerations

Before you migrate your entire LLM serving stack to DLMs, there are real engineering tradeoffs to understand.

When to Use Which Mode

Use AR mode when:

  • You need strict output parity with an existing AR deployment during an A/B rollout
  • You're debugging unexpected DLM outputs and need a reference
  • Your application requires sampling with high temperature (>1.0) and you haven't validated DLM output quality at that temperature yet

Use FastDiffuser when:

  • You're running batch inference where throughput matters more than individual request latency
  • Your use case tolerates a small (typically <1%) quality delta vs. AR
  • You're serving code completion or summarization at scale

Use LinearSpec (Self-Speculation) when:

  • You want maximum throughput with zero quality regression
  • You're using greedy decoding (temperature=0) — LinearSpec is mathematically lossless here
  • You're building latency-sensitive interactive applications and every millisecond counts

Use QuadSpec when:

  • You're running offline batch jobs where maximum throughput is the only objective
  • You've validated the small quality delta against your specific task distribution

Batch Size Effects

DLMs have a different batch size curve than AR models. AR models benefit significantly from batching because KV cache reuse amortizes memory overhead. DLMs benefit less from batching (their within-block parallelism already keeps compute units busy at batch size 1) but also degrade less at small batch sizes — which is where AR models suffer most.

In practice, if your P50 batch size in production is below 4, DLMs in self-speculation mode are likely to be strictly superior to AR models on both throughput and per-request latency.

KV Cache Behavior

Block-wise attention is KV-cache compatible by design. Within each block, all positions are computed simultaneously, and their KV values are cached for use by subsequent blocks. This is a key advantage over earlier DLM architectures that required full re-computation on every denoising step — a major engineering win from the Efficient-DLM paper.

Memory usage for Nemotron-Labs Diffusion at equivalent context lengths is comparable to AR models, with a slight overhead from the block size padding. For a 32-token block size, you'll see a maximum of 31 "wasted" positions at sequence boundaries — negligible in practice.


9. The Bigger Picture: What DLMs Mean for the LLM Ecosystem

Nemotron-Labs Diffusion is not just an incremental performance win. It represents a fundamental bifurcation in how the industry thinks about LLM architecture and inference.

The Speculative Decoding Landscape Shifts

Speculative decoding — using a small draft model to propose tokens that a large verifier model accepts or rejects — has become a popular technique for AR acceleration. DLM self-speculation achieves similar or better speedups using only a single model for both drafting and verification. This eliminates the complexity of maintaining two model versions, managing draft/verifier alignment, and the memory overhead of running two models in tandem.

For teams currently running speculative decoding pipelines, DLM self-speculation is architecturally simpler and achieves comparable or superior throughput numbers.

Edge and On-Device Implications

The 3B Nemotron-Labs Diffusion model already has 14,000+ downloads on launch day, suggesting significant interest from developers targeting constrained hardware. At batch size 1 on a mid-range device, DLMs' memory-bandwidth efficiency advantage is largest — the exact regime where edge deployment lives.

The VLM-8B variant (vision-language) extends these benefits to multimodal tasks, suggesting a future where on-device vision-language assistants run at interactive speeds without dedicated NPU hardware.

The Research Frontier Ahead

The Efficient-DLM conversion methodology enables a compelling path: pretrain a powerful AR model (leverage the entire AR training ecosystem), then convert it to a DLM in a few billion tokens of continued training. This means every future large AR model — Qwen, Llama, Mistral — is a candidate for DLM conversion.

The immediate research questions the community will pursue:

  • Longer block sizes: Can blocks of 64 or 128 tokens be made reliable? This would push TPF gains even higher.
  • Speculative DLM cascades: Can you chain DLMs of different sizes for even more aggressive speculative gains?
  • Instruction fine-tuning alignment: How does DLM generation affect RLHF-trained alignment properties?
  • Stochastic generation quality: Current self-speculation guarantees are only lossless at temperature=0. Extending this to sampled generation is an open problem.

10. Conclusion

The autoregressive paradigm has dominated language model generation since the original GPT paper. It has been enormously successful — but it carries a fundamental structural tax that grows more expensive as models scale and as applications demand lower latency and higher throughput.

Diffusion language models attack this tax at the architecture level. By generating tokens in parallel blocks and refining them iteratively, DLMs unlock the full compute capacity of modern GPU hardware — delivering throughput gains that no amount of systems-level optimization can achieve on a strictly autoregressive model.

NVIDIA's Nemotron-Labs Diffusion (released today) is the clearest proof-of-concept at production scale: a family of 3B, 8B, and 14B models that beat Qwen3 8B on accuracy and deliver up to 6.4× throughput gains, all while remaining compatible with existing deployment tooling via a single flag in SGLang.

The AR-to-DLM conversion technique from the Efficient-DLM paper means this improvement is replicable across any capable pretrained model. We are likely entering a period where every frontier model has a DLM variant — and where autoregressive-only serving becomes the legacy choice.

The models are live on HuggingFace today. Here's your three-step action plan:

  1. pip install transformers and load nvidia/Nemotron-Labs-Diffusion-3B — it fits on a single consumer GPU in BF16
  2. Run your existing benchmark suite in AR mode to establish a baseline
  3. Flip to linear_spec mode (temperature=0), re-run, and measure throughput delta

If your use case is latency-sensitive and you're still on a pure autoregressive stack, the gap between you and teams running DLMs will only widen from here.


Resources


Tags: diffusion-language-models llm-inference nvidia nemotron generative-ai machine-learning transformers mlops gpu-optimization sglang