惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LangChain Blog
Security Latest
Security Latest
P
Proofpoint News Feed
GbyAI
GbyAI
PCI Perspectives
PCI Perspectives
博客园 - Franky
N
Netflix TechBlog - Medium
博客园_首页
WordPress大学
WordPress大学
K
Kaspersky official blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Vercel News
Vercel News
T
Threatpost
The Hacker News
The Hacker News
H
Help Net Security
S
Securelist
Recent Announcements
Recent Announcements
腾讯CDC
T
Tailwind CSS Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Engineering at Meta
Engineering at Meta
C
Cisco Blogs
V
V2EX
C
Check Point Blog
S
Schneier on Security
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Jina AI
Jina AI
M
MIT News - Artificial intelligence
T
Threat Research - Cisco Blogs
博客园 - 叶小钗
A
Arctic Wolf
AWS News Blog
AWS News Blog
Latest news
Latest news
Martin Fowler
Martin Fowler
Recorded Future
Recorded Future
Last Week in AI
Last Week in AI
The GitHub Blog
The GitHub Blog
小众软件
小众软件
B
Blog
aimingoo的专栏
aimingoo的专栏
C
Cyber Attacks, Cyber Crime and Cyber Security
V
Visual Studio Blog
P
Palo Alto Networks Blog
Spread Privacy
Spread Privacy

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM Architecture
Manoranjan R · 2026-05-24 · via DEV Community

Meta Description: NVIDIA just open-sourced Nemotron-Labs Diffusion — a family of 3B, 8B, and 14B diffusion language models that merge autoregressive and diffusion generation for up to 6.4× faster inference. Here's the complete technical deep dive into the architecture, training methodology, three generation modes, and how to run it today with SGLang.

Diffusion Language Models Hero Banner


Table of Contents

  1. The Speed Wall Autoregressive LLMs Hit
  2. What Are Diffusion Language Models?
  3. Why DLMs Struggled — Until Now
  4. NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM
  5. Nemotron-Labs Diffusion: The Model Family
  6. Three Generation Modes: AR, Diffusion, Self-Speculation
  7. Performance Deep Dive: The Numbers That Matter
  8. Under the Hood: Block-Wise Attention & KV Caching
  9. Getting Started: Running with SGLang
  10. What This Means for Production LLM Infrastructure
  11. Conclusion & The Road Ahead

1. The Speed Wall Autoregressive LLMs Hit

Every language model you've ever used — GPT-4, Claude, Llama, Qwen — generates text the same fundamental way: one token at a time, left to right, each new token conditioned on every previous one. It's called autoregressive (AR) generation, and it's been the undisputed king of language modeling since the original GPT paper in 2018.

But AR generation has a dirty secret. It's not a compute-bound problem. It's a memory-bandwidth-bound problem.

Here's why that matters: each new token requires a full model forward pass. That means loading all the model's weights — potentially tens of gigabytes for a 7B model — from HBM (High Bandwidth Memory) into the GPU's compute cores, every single decoding step. On modern GPUs, the arithmetic throughput is enormous, but the memory bandwidth is the bottleneck. This is why serving an LLM at batch size 1 — a single user chatting with your model — leaves your GPU vastly underutilized.

The math is brutal. An A100 80GB GPU has ~2TB/s of HBM bandwidth. A 7B-parameter model in FP16 takes ~14GB. Reading all weights takes ~7ms minimum per step. At 30 tokens/second, you're spending the vast majority of each step just moving weights, not computing. Scale this to a production API endpoint handling thousands of concurrent users, and the economics become painful.

The community has attacked this problem from many angles: speculative decoding (using a small draft model to propose tokens verified by the large model), quantization (FP8, INT4 to shrink weight footprint), and FlashAttention (optimizing the KV-cache access pattern). These are all incremental improvements on the same fundamental loop.

NVIDIA's Nemotron-Labs Diffusion — released on HuggingFace on May 23, 2026 — is taking a fundamentally different approach. Instead of optimizing the autoregressive loop, it breaks the loop entirely.


2. What Are Diffusion Language Models?

If you've worked with image generation models (Stable Diffusion, DALL·E, Flux), you already know the concept of denoising diffusion. The idea is to start with pure noise and iteratively denoise it, guided by a conditioning signal, until you arrive at a coherent output.

Diffusion Language Models (DLMs) apply this same paradigm to text. Instead of generating tokens left-to-right, a DLM:

  1. Starts with a sequence of masked or noisy tokens (analogous to Gaussian noise in image diffusion)
  2. Runs multiple denoising refinement steps, predicting the clean token distribution at each step
  3. After several iterations, the entire sequence — or a large block of it — converges to the final output

Autoregressive vs Diffusion Language Model Architecture

The key theoretical advantage is parallelism. In a standard AR model, token t can only be generated after token t-1 exists. In a DLM, all positions in a block are refined simultaneously in each forward pass. This changes the computational profile dramatically: instead of being memory-bandwidth-bound by sequential weight loads, the GPU can be kept busy with dense matrix multiplications across the full block.

The conceptual roots of DLMs trace back to Masked Diffusion Language Models (MDLMs) — work like MDLM (Sahoo et al., 2024) and SEDD (Lou et al., 2023) — that framed text generation as a discrete denoising process over masked token sequences. However, these models had significant practical shortcomings when compared to the state-of-the-art AR models of the day. NVIDIA's work specifically addresses why, and more importantly, how to fix it.


3. Why DLMs Struggled — Until Now

The community has known about the theoretical appeal of diffusion language models for years. The reason they haven't taken over is a cluster of practical barriers that made them non-competitive with AR models in production:

1. Accuracy Gap: DLMs trained from scratch consistently underperformed comparably-sized AR models on standard benchmarks. The discrete, iterative denoising process is harder to optimize than the clean causal language modeling objective. Models like Dream 7B were impressive for DLMs, but still lagged behind Qwen3 4B — a smaller AR model — on reasoning and knowledge tasks.

2. Training Instability: Jointly learning to denoise across many noise levels with a bidirectional attention mask creates a different gradient landscape than causal language modeling. Loss curves are noisier, and the model is more sensitive to hyperparameter choices.

3. No KV Cache Compatibility: This was the killer for inference efficiency. KV caching — where you store key/value activations from previous tokens to avoid recomputing them — is the single most important optimization for AR inference. Standard DLMs use fully bidirectional attention across the entire sequence, which means you can't cache anything: every refinement step needs to attend over all positions with the updated token states. This essentially erased the theoretical throughput advantage.

4. Fill-in-the-Middle Mismatch: During DLM training, tokens are masked uniformly at random across the sequence. But at inference time, the model typically has a left-side prefix (the prompt) that is fully unmasked, and must fill in the right side. This creates a training-test distribution mismatch that degrades quality.

Each of these problems has a specific technical solution in NVIDIA's Efficient-DLM framework. Let's dig in.


4. NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM

The foundational insight behind Nemotron-Labs Diffusion (and the academic paper it builds on, arXiv:2512.14067) is deceptively simple: don't train DLMs from scratch — convert pretrained AR models into DLMs.

This avoids the accuracy gap problem entirely. You start with a model that already has world-class knowledge and reasoning capabilities baked into its weights, then teach it to also generate diffusion-style. The result is a model that retains AR accuracy while gaining diffusion parallelism.

But there are two critical technical challenges to solve for this conversion to work.

4.1 Block-Wise Attention: Preserving Weights, Enabling KV Caching

The attention mechanism is the crux of the problem. A standard AR model uses causal (lower-triangular) attention — each token attends only to itself and all previous tokens. A standard DLM uses bidirectional (full) attention — every token attends to every other token.

The issue: if you convert an AR model and suddenly change to fully bidirectional attention, you've broken the statistical assumptions baked into all those attention weights during pretraining. The key-value projections were trained to operate in a causal setting; they "expect" not to see future tokens. Loading them into a fully bidirectional context produces degraded output and requires extensive retraining to recover.

Efficient-DLM introduces block-wise causal attention as the solution:

  • The sequence is divided into non-overlapping blocks of size B (e.g., 32 tokens)
  • Within each block: full bidirectional attention (every token attends to every other token in the block)
  • Across blocks: standard left-to-right causal attention (block i can attend to blocks 0 through i-1)

Block-wise Attention Pattern with KV Caching

This hybrid pattern does something clever: it's structurally similar enough to causal attention that pretrained weight distributions are preserved — the model only needs to learn bidirectionality locally within blocks, not globally across the whole sequence. The result is a much smoother conversion that requires far less compute to recover quality.

Crucially, this also re-enables KV caching. Since attention is still causal across blocks, the KV activations of completed (committed) blocks can be cached and reused exactly like in a standard AR model. Only the current block being refined needs to be recomputed each refinement step.

4.2 Position-Dependent Token Masking

The second innovation addresses the training-test distribution mismatch. Instead of masking tokens uniformly at random during training, Efficient-DLM uses a position-dependent masking strategy that assigns higher masking probabilities to tokens in later positions in the sequence.

The intuition: at inference time, when filling in a response to a prompt, earlier tokens in the response have already been decided (or are more constrained by the left-side context), while later tokens remain more uncertain. By skewing the training mask distribution to match this pattern, the model learns a denoising objective that better mirrors what it actually faces at test time.

4.3 Joint AR + Diffusion Training Objective

Rather than optimizing purely for the diffusion objective, Nemotron-Labs Diffusion is trained with a joint AR and diffusion loss:

L_total = λ · L_AR + (1 - λ) · L_diffusion

Enter fullscreen mode Exit fullscreen mode

Where L_AR is the standard cross-entropy causal language modeling loss and L_diffusion is the masked diffusion objective. This joint training ensures the model remains a first-class AR model while learning the diffusion generation capability.

The pretrained base was trained on 1.3 trillion tokens from NVIDIA's Nemotron pretraining datasets, with an additional 45 billion tokens of supervised fine-tuning data for the instruct-tuned variants.


5. Nemotron-Labs Diffusion: The Model Family

NVIDIA released seven model checkpoints on HuggingFace under the NVIDIA Nemotron Open Model License (commercially friendly for text models):

Model Parameters Type Downloads (Day 1)
nvidia/Nemotron-Labs-Diffusion-3B ~4B Text, Instruct 14.7K
nvidia/Nemotron-Labs-Diffusion-3B-Base ~4B Text, Base 14.2K
nvidia/Nemotron-Labs-Diffusion-8B 8B Text, Instruct 24.1K
nvidia/Nemotron-Labs-Diffusion-8B-Base 8B Text, Base 228K
nvidia/Nemotron-Labs-Diffusion-14B 14B Text, Instruct 3.28K
nvidia/Nemotron-Labs-Diffusion-14B-Base 14B Text, Base 1.18K
nvidia/Nemotron-Labs-Diffusion-VLM-8B ~9B Vision-Language 590

The 8B Base model being the most downloaded (228K in under 2 days) reflects developer interest in using it as a foundation for custom fine-tuning.


6. Three Generation Modes: AR, Diffusion, Self-Speculation

The standout design decision in Nemotron-Labs Diffusion is that all three generation modes are supported from a single checkpoint. You don't need different models — just a different deployment config in SGLang.

Three Generation Modes Performance Comparison

Mode 1: Autoregressive (ar_mode=true)

Standard left-to-right token generation, identical to how you'd run any other causal LM. This mode is the correctness baseline — most useful for debugging, A/B testing against existing pipelines, or when you need strict adherence to specific decoding behaviors.

Use when: Debugging, regression testing, or exact reproduction of AR outputs.

Mode 2: Diffusion / FastDiffuser (diffusion_mode=true)

The model fills in a block of 32 tokens simultaneously, running multiple denoising refinement steps per block. A confidence threshold determines which tokens are "committed" after each refinement pass — tokens whose predicted distribution is peaked enough get locked in, reducing the number of positions that need further refinement.

The process per block:

  1. Initialize block positions with mask tokens
  2. Forward pass with block-wise attention → predict token distributions over all positions
  3. Commit tokens above confidence threshold; keep others masked
  4. Repeat steps 2–3 until all positions are committed or max steps reached
  5. Move to next block, using committed block tokens in KV cache

Achieves 2.6× higher tokens per forward pass (TPF) compared to AR.

Use when: High-throughput batch serving where speed matters more than exact AR equivalence.

Mode 3: Self-Speculation / LinearSpec (self_speculation=true)

This is the most sophisticated mode — it fuses diffusion and autoregressive decoding into a single hybrid loop:

  1. The model uses diffusion to draft a full block of k candidate tokens bidirectionally (fast, parallel)
  2. It then uses autoregressive decoding to verify the draft tokens causally from left to right
  3. Any prefix of the draft that matches what AR would have produced gets committed
  4. The process restarts from the first disagreement position

The same model plays both roles (drafter and verifier). Output is lossless vs AR at temperature=0.

Key numbers: LinearSpec achieves ~6× higher TPF than AR, and ~865 tokens/second on NVIDIA B200 hardware — roughly 4× the AR baseline on identical hardware.

Use when: Production serving where you need maximum speed with no quality compromise.


7. Performance Deep Dive: The Numbers That Matter

Accuracy vs Qwen3 8B:
Nemotron-Labs Diffusion 8B achieves +1.2% higher average accuracy compared to Qwen3 8B across evaluated benchmarks. The DLM conversion doesn't hurt quality — it slightly improves it, likely because the joint AR+diffusion training objective acts as an additional regularizer.

vs Dream 7B (prior DLM SOTA):
Efficient-DLM 8B achieves +5.4% higher accuracy and 4.5× higher throughput compared to Dream 7B — a decisive improvement over the previous DLM state-of-the-art.

Throughput (Tokens Per Forward Pass — TPF):

Mode TPF (relative to AR) Quality vs AR
Autoregressive 1× (baseline) Exact match
Diffusion (FastDiffuser) 2.6× Slightly different
Self-Spec Linear (LinearSpec) ~6× Lossless at T=0
Self-Spec Quadratic (QuadSpec) ~6.4× Lossless at T=0

TPF (Tokens Per Forward Pass) is a hardware-agnostic efficiency metric — it measures how many output tokens you get per model forward pass, making it useful for comparing across different GPU generations.


8. Under the Hood: Block-Wise Attention & KV Caching

Let's look at exactly how the block-wise attention mechanism enables KV caching in a DLM setting.

In standard AR decoding, the KV cache stores the key and value projections for every previously generated token. When generating token t, the model attends to cached KV from tokens 0...(t-1) and computes new Q, K, V for position t only — O(1) cache update per step.

In a standard bidirectional DLM, this is impossible: since every token attends to every other token, and token values change with each refinement step, you'd need to recompute the entire KV matrix every step — O(n²) per refinement, no caching benefit.

Block-wise causal attention resolves this with a two-level hierarchy:

Sequence: [Block 0 | Block 1 | Block 2 | ... | Block N]

For a token in Block i:
  - Attends to ALL tokens in blocks 0...(i-1)  → cached KV (never recomputed)
  - Attends to ALL tokens within Block i        → bidirectional, recomputed each step
  - CANNOT attend to tokens in blocks (i+1)+   → causal constraint maintained

Enter fullscreen mode Exit fullscreen mode

For a 32-token block size and 2048-token sequence, 98.4% of KV computations are served from cache at any given refinement step.

Here's how to build the attention mask in PyTorch:

import torch

def build_block_causal_mask(seq_len: int, block_size: int) -> torch.Tensor:
    """
    Build a block-wise causal attention mask.

    Within each block: full bidirectional attention (True)
    Across blocks: causal left-to-right attention (True only for past blocks)
    Future blocks: masked out (False → -inf in softmax)

    Returns a boolean mask of shape [seq_len, seq_len],
    where True = can attend, False = masked.
    """
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    num_blocks = seq_len // block_size

    for block_i in range(num_blocks):
        q_start = block_i * block_size
        q_end   = q_start + block_size

        # Attend to all past blocks (causal across blocks)
        for block_j in range(block_i):
            kv_start = block_j * block_size
            kv_end   = kv_start + block_size
            mask[q_start:q_end, kv_start:kv_end] = True

        # Attend fully within current block (bidirectional within block)
        mask[q_start:q_end, q_start:q_end] = True

    return mask


# Example: 4 blocks of 4 tokens each = 16 token sequence
mask = build_block_causal_mask(seq_len=16, block_size=4)
print(mask.int())

# Output (each row = query token, each col = key token):
# Block 0 rows: [1111 | 0000 | 0000 | 0000]
# Block 1 rows: [1111 | 1111 | 0000 | 0000]
# Block 2 rows: [1111 | 1111 | 1111 | 0000]
# Block 3 rows: [1111 | 1111 | 1111 | 1111]

Enter fullscreen mode Exit fullscreen mode

The resulting mask has fully-connected 4×4 diagonal blocks (bidirectional within blocks) with a lower-triangular structure across block boundaries (causal across blocks). It's the AR causal mask, coarsened to block granularity — which is precisely why pretrained AR weight distributions are preserved.


9. Getting Started: Running with SGLang

SGLang is the recommended serving framework for Nemotron-Labs Diffusion, with integration via PR #25803 (merging into main imminently). Here's a complete working example.

9.1 Installation

# Install SGLang with DLM support
pip install "sglang[all]>=0.4.5" --extra-index-url https://flashinfer.ai/whl/cu124/torch2.5/

# If the PR hasn't merged to main yet, install from the DLM branch directly:
# git clone https://github.com/sgl-project/sglang.git
# cd sglang && git fetch origin pull/25803/head:dlm-support
# git checkout dlm-support && pip install -e ".[all]"

# Pull the model weights
pip install huggingface-hub
huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B \
  --local-dir ./models/Nemotron-Labs-Diffusion-8B

Enter fullscreen mode Exit fullscreen mode

9.2 Serving: Launch the SGLang Server

# Mode 1 — Autoregressive (standard baseline)
python -m sglang.launch_server \
  --model-path ./models/Nemotron-Labs-Diffusion-8B \
  --port 30000 --tp 1 --dtype bfloat16 \
  --algorithm ar_mode

# Mode 2 — Diffusion (FastDiffuser): highest raw throughput
python -m sglang.launch_server \
  --model-path ./models/Nemotron-Labs-Diffusion-8B \
  --port 30000 --tp 1 --dtype bfloat16 \
  --algorithm diffusion \
  --block-size 32 \
  --confidence-threshold 0.9

# Mode 3 — Self-Speculation (LinearSpec): lossless 6x speedup
python -m sglang.launch_server \
  --model-path ./models/Nemotron-Labs-Diffusion-8B \
  --port 30000 --tp 1 --dtype bfloat16 \
  --algorithm linear_spec \
  --draft-block-size 32

Enter fullscreen mode Exit fullscreen mode

9.3 Inference: Python Client (OpenAI-Compatible API)

import openai
import time

# SGLang exposes an OpenAI-compatible API endpoint
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require auth by default
)

PROMPT = """You are an expert in distributed systems.
Explain the CAP theorem and its practical implications for a microservices
architecture. Be specific with concrete trade-off examples."""

def benchmark_mode(label: str, mode_hint: str = ""):
    """Run a generation and measure wall-clock tokens/second."""
    start = time.perf_counter()

    response = client.chat.completions.create(
        model="nvidia/Nemotron-Labs-Diffusion-8B",
        messages=[{"role": "user", "content": PROMPT}],
        max_tokens=512,
        temperature=0,        # T=0 → LinearSpec is lossless vs AR
        extra_body={
            "mode": mode_hint  # "ar", "diffusion", or "linear_spec"
        } if mode_hint else {}
    )

    elapsed = time.perf_counter() - start
    tokens  = response.usage.completion_tokens
    tps     = tokens / elapsed

    print(f"\n{'='*60}")
    print(f"Mode        : {label}")
    print(f"Output      : {response.choices[0].message.content[:200]}...")
    print(f"Tokens      : {tokens}")
    print(f"Time (s)    : {elapsed:.2f}")
    print(f"Throughput  : {tps:.1f} tok/s")
    print(f"{'='*60}")
    return tps

# Compare all three modes
ar_tps   = benchmark_mode("Autoregressive",           mode_hint="ar")
diff_tps = benchmark_mode("Diffusion (FastDiffuser)", mode_hint="diffusion")
spec_tps = benchmark_mode("Self-Spec (LinearSpec)",   mode_hint="linear_spec")

print(f"\n📊 Speedup Summary:")
print(f"  Diffusion vs AR   : {diff_tps/ar_tps:.2f}×")
print(f"  LinearSpec vs AR  : {spec_tps/ar_tps:.2f}×")

Enter fullscreen mode Exit fullscreen mode

9.4 Quick Start via HuggingFace Transformers (AR Mode)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nvidia/Nemotron-Labs-Diffusion-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user",   "content": "Explain masked diffusion in 3 sentences."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=256,
        do_sample=False,
        use_cache=True
    )

response = tokenizer.decode(
    output_ids[0][input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(response)

Enter fullscreen mode Exit fullscreen mode

Note: The transformers path gives AR mode only. For diffusion and self-speculation modes, the SGLang integration is required as it implements the custom decoding loop.


10. What This Means for Production LLM Infrastructure

Latency vs Throughput Trade-off, Revisited

The classic LLM serving dilemma is that throughput optimizations (larger batch sizes, continuous batching) increase latency, and latency optimizations (small batches, low KV cache pressure) hurt throughput. Self-speculation in DLMs partially decouples this: at batch size 1, LinearSpec gives 4–6× more tokens per second than AR on the same hardware. This is the scenario where AR models are most inefficient, and where DLMs provide the biggest relative gain.

Cost Implications

A 4× throughput improvement at batch size 1 means you could serve the same number of users with 1/4 the GPU compute — or equivalently, serve 4× more users from the same GPU fleet. At current B200/H100 pricing of $4–8/hour, that's a meaningful cost reduction for any team running a production LLM API.

Fill-in-the-Middle and Code Editing

DLMs have a natural advantage for fill-in-the-middle (FIM) tasks. AR models handle FIM awkwardly, requiring special training and prompt formatting to look "backwards" at the suffix. A DLM generating a block bidirectionally can natively condition on both prefix and suffix context within the block — making Nemotron-Labs Diffusion well-suited for code editing agents and inline completions.

Inference Budget Control

In diffusion mode, you can control the number of denoising steps as a runtime knob. Fewer steps = faster but potentially lower quality. More steps = slower but higher quality. This gives you a continuous quality-speed trade-off at inference time without retraining — something AR models simply can't offer. A production system could dynamically reduce diffusion steps during traffic spikes and increase them during low-load periods.

When to Stick with AR

For long-context tasks (100K+ tokens) where the KV cache dominates memory, the efficiency story is less clear-cut. For streaming output where users see tokens as they're generated, block-wise generation may feel less smooth without careful rendering logic. And for tasks requiring strict constrained decoding (grammar-constrained generation, beam search), the diffusion loop needs further tooling work.


11. Conclusion & The Road Ahead

Diffusion Language Models have been a promising idea for years, perennially held back by a cluster of practical barriers: accuracy gaps, training instability, and the loss of KV caching. NVIDIA's Efficient-DLM work and Nemotron-Labs Diffusion have systematically addressed each of these barriers with concrete, principled solutions — block-wise causal attention, position-dependent masking, and joint AR+diffusion training objectives.

The result is a model family that is simultaneously:

  • A first-class AR model (backward compatible, lossless in LinearSpec mode)
  • A 2.6–6.4× faster inference engine (depending on mode and hardware)
  • 🔲 A better fill-in-the-middle model by architectural design
  • 🎛️ A tunable quality-speed dial at deployment time — no retraining needed

With 24K+ downloads in the first 24 hours and SGLang integration landing imminently, this is one of the most practically significant open-source releases in the LLM inference space in 2026.

The next frontier: applying the same AR-to-DLM conversion recipe to frontier-scale models (70B+), exploring multimodal DLMs beyond the 8B VLM preview, and building out constrained decoding, streaming token rendering, and fine-tuning tooling for the DLM objective.

If you're building LLM-powered applications and care about inference cost and latency, it's time to start experimenting with Nemotron-Labs Diffusion. The autoregressive loop had a good run — but the next chapter of language model inference looks decidedly more parallel.


🔗 Resources


Written on May 24, 2026 — based on the HuggingFace blog post and arXiv:2512.14067 (Efficient-DLM). Performance numbers reflect published benchmarks; verify against your specific hardware and workload.