惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Managing multiple docker hub accounts using docker-use System Design Interview: Decentralized Web Crawler Metric Cardinality: High or Low? 4 Steps to Making the Right Choice 로컬 LLM 셋업 가이드 (v23) GEO vs SEO in 2026 — What Google's May Guidance Changed Cursor Review 2026 — Honest 'Not For Me' Take From a VSCode User Hello from rikuq — a practitioner blog for solo AI SaaS founders Why DevOps Engineers Need Practical Tutorials, Not Just Theory AI Agents in CI/CD: Give Them Context, Not Production Authority Now I See Why Translators Are Panicking Over AI—Should Coders Panic Too? Why I Track HRV Every Morning (And How It Actually Changes My Day) Chatbots GPT pour le support client : ce que les équipes françaises ont réellement besoin de savoir I Hit the 1,232-Byte Wall So You Don't Have To Google Just Rebuilt the Search Box (Again) — But This Time It's Different Aether: A local Android assistant built with Gemma 4 BoxAgnts Introduction (1) — Out of the Box mkdev: trusted HTTPS for localhost, mapped by name Just one question, one answer. Why Java Still Rules the Programming World in 2026 Four Architectures for Letting Claude Edit Elementor (and Why We Shipped Clone-and-Mutate) yard-yaml 0.1.1: safer UTF-8 handling for YAML documentation I Built a Mac App That Keeps Your Clipboard in Sync Across All Your Android Devices Stop Using UUIDs: Why B2B SaaS Needs ULIDs in Laravel 🐘 I'm a non-technical founder who built a Slack approval tool. Here's what actually broke first. Open-Sourcing Our Game AI Stack — SDKs, Templates, and CLI Tools for NPC Dialogue I Built an AI System That Makes 1,000 Decisions a Day. Here's Where I Drew the Line. Lets Encrypt DNS Challenge with Traefik and AWS Route 53 Building an agent-ready website: how to make your site readable for ChatGPT, Perplexity and autonomous agents A productivity tool with GitHub as your cloud database How We Built Dynamic NPC Dialogue with LLMs — Lessons from Early Access cmux: The Native macOS Terminal Built for Running AI Coding Agents in Parallel Deep Atlantic Storage: Rewriting in Rust How I Built a Bulk Image Optimizer with $0 Server Costs Using Vanilla JS and Canvas API Humans and Machines read differently, I think I have a fix? Claude Code Deleted 92 Images Without Asking. This Happens More Than You Think. Method Calling Stack in Java I Built Schedule Sensei & Pushed It to GitHub – Here's What's Inside (And I Need Your Help 👀) OIC: From a Working Toast Watcher to a General "Watch It for Me" Agent Memory is two-thirds of what an AI chip costs to build The XState persistence problem is five years old. Here is what we built to finally solve it. i added MCP support to my SaaS in an afternoon. here's the whole thing. Framework: Link Building ☁️ Importing existing S3 buckets into Terraform state made easy with terraform import existing s3 bucket I Built a Token System on Solana (Without Any Backend Code) 터미널 AI 에이전트 구축 (v21) I Built an AI 3D Model Generator — Here's How I Handle Meshes in the Browser 🛡️ PromptGuard: I Built a Local AI Privacy Firewall That Sanitizes Your Prompts Before They Leave Your Machine PostgreSQL WAL Bloat: Why Automatic Management Is Often Insufficient? Seven PRs Before Lunch: Parallel Claude Code Tabs Plus Audit-Before-Bump Deployment using all three Kubernetes probes Qwen 3.6 Has Four Tiers. Here's How to Route Without Burning Cash. RAG 시스템 실전 구축 (v21) How I handle my errors in PHP The Blind Spot in Treasure Hunt Engine Configuration: Long-Term Server Health Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint Webflow SEO Implementation 로컬 LLM 셋업 가이드 (v21) How Logs Travel From Your EKS Pod to Datadog 𝗦𝘁𝗼𝗽 𝗖𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗙𝗼𝗿 𝗘𝘅𝗮𝗺𝘀, 𝗦𝘁𝗮𝗿𝘁 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗦𝗸𝗶𝗹𝗹𝘀 How to Use EXPLAIN ANALYZE in PostgreSQL: A Visual Guide gRPC Performance: tonic (Rust) vs grpc-go Benchmarked at Scale Hack The Box (HTB): Cap Machine (Full Walkthrough) Visual Search Optimization studygemma: AI study buddy for CS students Architectural Tradeoffs in Webhook Idempotency and SaaS API Versioning One Open Source Project a Day (No. 75): Understand Anything - The AI Engine That Turns Any Codebase Into an Explorable Knowledge Graph From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging I built a free music tool AI Talking Avatar Pipelines Broke Our Ad CTR by 3.7% 800G to 400G Breakout: How to Scale 400G Networks with 800G Ports 터미널 AI 에이전트 구축 (v20) Topical Authority Architecture Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does How Logs Travel From Your EKS Pod to Datadog The Hidden Journey Inside / Kubernetes Is it safe to connect my bank account to AI? No Room — The World of Aying (8/12) Fossils — The World of Aying (10/12) Familiar Stranger — The World of Aying (9/12) Being Seen — The World of Aying (7/12) [I Ran an AI Agent for 30 Days Straight — Here's the Boring Engineering That Made It Work] Gemma 4: The 128K Multimodal Powerhouse in Your Terminal How to Consolidate Your QA Toolstack: A Practical Buyer's Guide The Thank-You Email Almost Nobody Sends (And Why That's Your Edge) Schema Types 2026 Idempotency Keys: The API Safety Net You're Probably Not Using How to let Claude see my Plaid bank data Kiro Did It: Build a Simple Portfolio Website with Kiro IDE | From Prompt to HTML Prototype Islands of Commerce: What Marketplace Founders Can Learn from 60 Years of Island Biogeography React Pointer Hooks: Hover, Long-Press, Double-Click, Scratch, and Click-Outside Without the Bugs Engineering decisions for my video call tool VBScript Still Lives: How a Custom Go VM Brought Classic ASP to Linux and Mac What Happens When You Teach Old Scripting Languages New Runtime Tricks? I Tested 6 AI Coding Assistants for a Month. Here's What Actually Works. Extendscript Still Has Life Afriex Webhook Integration Guide: Signature Verification, Event Handling, and Production Best Practices The Blind Alleys of Veltrix Configuration How an ESP32 Turned a LEGO WALL-E Into a Real Working Robot The Flawed Promise of Real-Time Event Handling SSH Login Taking Forever? Check Your DNS Settings Found 897 Fake Followers on DEV.to. Here's How I Proved It.
Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation
Manoranjan R · 2026-05-25 · via DEV Community

Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation

Published May 25, 2026 · 18 min read


Table of Contents

  1. The Token-by-Token Tax — Why We Need Something Better
  2. Why Autoregressive Generation Is Fundamentally Memory-Bound
  3. What Is a Diffusion Language Model?
  4. The Efficient-DLM Training Trick: Converting AR Models Into DLMs
  5. Inside Nemotron-Labs Diffusion: Three Inference Modes
  6. Deploying DLMs with SGLang — A Practical Guide
  7. Fill-in-the-Middle and Code Infill: Why DLMs Win at Revision Tasks
  8. Constraint Decay in Coding Agents — And How DLMs Can Help
  9. Limitations and Open Questions
  10. Conclusion — The Autoregressive Era Has Competition

1. The Token-by-Token Tax

What if your LLM could write an entire paragraph in a single forward pass — then revise it before handing anything back to you?

That's not a speculative future. That's what NVIDIA shipped on May 23, 2026.

For the last five years, every major language model — GPT-4, Claude, Llama, Gemini, Qwen — has generated text the same way: one token at a time, left to right, forever forward, never looking back. This autoregressive (AR) paradigm has been extraordinarily powerful. It's also been a performance ceiling that the entire industry has been quietly bumping into.

The compute story is brutal if you look at it squarely: generating a 2,000-token response requires 2,000 sequential forward passes through your multi-billion-parameter model. Every single pass re-loads all those weights from memory. Your H100 — a machine with 3.35 TB/s of memory bandwidth and 989 TFLOPS of FP16 compute — spends the vast majority of its time in memory operations, not computation. You're paying for a race car and spending most of your time in the pit lane.

NVIDIA's Nemotron-Labs Diffusion is the first production-grade diffusion language model family to directly challenge this assumption at scale. Released publicly on HuggingFace on May 23, 2026, it comes in 3B, 8B, and 14B parameter variants, achieves 6.4× higher throughput than equivalent autoregressive models, and — critically — does it without sacrificing output quality. The 8B variant actually outperforms Qwen3-8B by 1.2% on average across benchmarks.

This article goes deep on how diffusion language models actually work under the hood, how Nemotron-Labs Diffusion was built on the Efficient-DLM training framework, what the three inference modes mean for your production architecture, and how to get your hands on it today.

Autoregressive vs Diffusion Language Model architecture comparison — sequential token generation vs parallel block denoising


2. Why Autoregressive Generation Is Fundamentally Memory-Bound

To appreciate what diffusion language models solve, you first need to understand exactly why autoregressive generation is slow — and why making your GPU faster doesn't fully fix it.

The KV Cache and the One-Token-at-a-Time Law

In an autoregressive transformer, generating token t+1 requires attending over all previous tokens 0..t. The standard optimization here is the KV cache: instead of recomputing the Key and Value projections for all prior tokens on every step, you cache them and only compute new K/V for the freshly generated token.

This is the right optimization — it reduces per-step compute from O(N²) to O(N). But it doesn't change the fundamental structure: you still do one forward pass, commit one token, and repeat.

The time to generate N tokens is therefore:

total_latency = N × (weight_load_time + compute_time + sampling_time)

Enter fullscreen mode Exit fullscreen mode

For large models (>7B parameters), weight_load_time dominates, especially at batch size 1. An 8B parameter model has roughly 16GB of weights in FP16. At H100 memory bandwidth of 3.35 TB/s, the theoretical minimum to load all weights once is ~4.8ms. At 2,000 tokens, that's a floor of ~9.6 seconds just in memory I/O — before any actual computation.

The Roofline Analysis

The roofline model is the cleanest way to visualize this. Every GPU has two performance limits:

  • Compute-bound roof: Determined by peak FP16/BF16 TFLOPS
  • Memory-bound roof: Determined by peak memory bandwidth × arithmetic intensity threshold

For a 7B model forward pass generating a single token, the arithmetic intensity (FLOPs per byte accessed) is approximately:

Arithmetic Intensity = (2 × N_params × batch_size) / (2 × N_params × bytes_per_param)
                     = batch_size / bytes_per_param
                     ≈ 1 / 2  (batch_size=1, BF16=2 bytes)
                     = 0.5 FLOP/byte

Enter fullscreen mode Exit fullscreen mode

The H100's ridge point (the crossover between memory-bound and compute-bound) is approximately ~300 FLOP/byte. At 0.5 FLOP/byte, you're at less than 0.2% of the compute-bound performance. You are almost entirely memory-bound — wasting the majority of what your GPU can do.

Why Batching Helps — But Has Limits

Batching is the standard answer: run more requests concurrently to increase arithmetic intensity. At batch size 128:

Arithmetic Intensity ≈ 128 / 2 = 64 FLOP/byte

Enter fullscreen mode Exit fullscreen mode

Better — but still 5× below the ridge point. And in production latency-sensitive scenarios (chat, copilots, real-time agents), you often can't batch aggressively. Individual users don't want to wait for 127 other requests to fill a batch.

This is the performance trap that diffusion language models are designed to escape.


3. What Is a Diffusion Language Model?

Diffusion models were first popularized for image generation (DDPM, Stable Diffusion, DALL-E). The core idea: instead of generating output autoregressively, start from pure noise and iteratively denoise toward a clean sample. The insight that DLMs bring to text generation is applying this same iterative refinement paradigm to sequences of discrete tokens.

From Gaussian Noise to Masked Token Diffusion

Image diffusion operates in continuous space: add Gaussian noise gradually, train a neural network to reverse that process. Text tokens are discrete — you can't add Gaussian noise to the word "transformer." The solution is absorbing diffusion (also called masked diffusion): rather than adding Gaussian noise, tokens are progressively masked (replaced with a special [MASK] token), and the model learns to unmask them. This is distinct from image diffusion — there is no noise distribution over real values, only a binary clean-or-masked state for each token position.

The forward (corruption) process replaces tokens with [MASK] with probability proportional to a noise schedule. At maximum corruption, the entire sequence is masked. The reverse (generation) process starts from a fully masked sequence and iteratively fills it in.

Formally, the model parameterises the conditional distribution:

p_θ(x₀ | xₜ)

Enter fullscreen mode Exit fullscreen mode

Where x₀ is the clean token sequence, xₜ is the corrupted sequence at timestep t, and the model predicts the original clean tokens for all masked positions simultaneously — that "simultaneously" is the entire game.

Block-Level Generation in Practice

Modern DLMs like Nemotron-Labs Diffusion operate at the block level rather than the full-sequence level. The model generates output in 32-token blocks. For each block:

  1. Initialize the block with [MASK] tokens
  2. Run a forward pass predicting all 32 token positions simultaneously
  3. Accept high-confidence predictions (above threshold τ) and re-mask low-confidence ones
  4. Repeat the denoising loop for K steps until the block stabilises
  5. Commit the block and advance to the next

This has a critical GPU efficiency property: instead of one matrix multiply per token, you do one matrix multiply per block of 32 tokens. The effective arithmetic intensity scales with block size:

Arithmetic Intensity (DLM) ≈ (batch_size × block_size) / bytes_per_param
                            = (1 × 32) / 2
                            = 16 FLOP/byte

Enter fullscreen mode Exit fullscreen mode

That's 32× the arithmetic intensity of single-token AR generation — 32× closer to the compute-bound regime. This is where the throughput gains come from.

DLM block denoising process — five stages from fully masked tokens to fully generated neon-lit output

The Bidirectional Advantage

There's a second — often underappreciated — advantage of DLMs: they generate bidirectionally within each block. An AR model has a hard causal constraint: token at position i can only attend to tokens at positions < i. This means:

  • The model cannot revise a previously committed token, ever
  • Fill-in-the-middle (generating tokens given left and right context) requires special training hacks
  • Once an error propagates, it compounds forward indefinitely

DLMs have no such constraint within their generation window. Attention within each block is fully bidirectional — every token attends to all other positions in the block simultaneously. This means the model makes all 32 decisions with full awareness of its other decisions, and updates any of them in subsequent denoising steps. Errors are corrected before they're committed.


4. The Efficient-DLM Training Trick: Converting AR Models Into DLMs

The biggest historical barrier to DLMs wasn't architectural — it was training efficiency. DLMs trained from scratch lagged significantly behind AR models of equivalent parameter count. The research breakthrough that unblocked Nemotron-Labs Diffusion is the Efficient-DLM framework (Fu et al., arXiv:2512.14067).

The key insight: don't train from scratch. Convert a pretrained AR model into a DLM.

Why AR-to-DLM Conversion Works

Pretrained AR models have already learned rich linguistic representations: grammar, facts, reasoning patterns, code structure. The AR training objective shapes the weight space in a way that's compatible with DLM objectives, because both ultimately require modelling p(token | context). The weight distributions learned under AR training are close to what a DLM needs.

The conversion proceeds through continued pretraining on the pretrained AR checkpoint, adding a diffusion training objective without discarding AR capability.

Block-Wise Attention: The Critical Design Choice

Efficient-DLM found that the attention pattern used during conversion is the most consequential design decision. Two options exist:

  1. Fully bidirectional attention — every token attends to every other token (like BERT)
  2. Block-wise attention — causal across blocks, bidirectional within blocks

Fully bidirectional attention diverges from the causal AR weight distribution, causing significant accuracy regression. Block-wise attention maintains causal structure at the block boundary level while enabling the intra-block bidirectionality needed for parallel generation — and it remains KV-cache-compatible.

# Simplified illustration of block-wise attention mask construction
import torch

def build_block_wise_attention_mask(seq_len: int, block_size: int) -> torch.Tensor:
    """
    Build a block-wise attention mask where:
    - Tokens within the same block attend bidirectionally to each other
    - Blocks attend causally to prior blocks (no future block leakage)

    Returns a boolean mask of shape (seq_len, seq_len)
    where True = attend, False = mask out
    """
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    num_blocks = seq_len // block_size

    for block_idx in range(num_blocks):
        block_start = block_idx * block_size
        block_end = block_start + block_size

        # Within-block: full bidirectional attention
        mask[block_start:block_end, block_start:block_end] = True

        # Cross-block: causal — current block attends to all previous blocks
        if block_idx > 0:
            mask[block_start:block_end, :block_start] = True

    return mask


# Visualise an 8-token sequence with block_size=4
mask = build_block_wise_attention_mask(seq_len=8, block_size=4)
print("Block-wise attention mask (1 = attend, 0 = masked):")
print(mask.int())
# Block 0: bidirectional within itself only
# Block 1: bidirectional within itself + sees all of Block 0

Enter fullscreen mode Exit fullscreen mode

Position-Dependent Token Masking

The second key innovation in Efficient-DLM is position-dependent token masking during training. In naive masked diffusion, tokens are masked uniformly at random. But at inference time, the masking pattern is left-to-right — earlier positions are already committed while later positions remain masked.

Efficient-DLM fixes this train-test mismatch by assigning higher masking probabilities to later positions during training, closely matching the left-to-right test-time pattern:

import numpy as np

def position_dependent_mask_rate(
    seq_len: int,
    base_rate: float = 0.15,
    position_scale: float = 2.0
) -> np.ndarray:
    """
    Compute per-position masking probability that increases toward end of sequence.
    Earlier tokens (already committed) get lower mask rates.
    Later tokens (not yet generated) get higher mask rates.

    Args:
        seq_len: Total sequence length
        base_rate: Minimum masking probability (applied to first token)
        position_scale: Multiplier — how much masking rate grows toward end

    Returns:
        Array of shape (seq_len,) with per-position mask probabilities
    """
    positions = np.linspace(0, 1, seq_len)
    mask_rates = base_rate * (1 + position_scale * positions)
    return np.clip(mask_rates, 0.0, 1.0)


rates = position_dependent_mask_rate(seq_len=32)
print(f"Token  0 mask rate: {rates[0]:.3f}")   # 0.150 — low, likely already committed
print(f"Token 15 mask rate: {rates[15]:.3f}")  # 0.300 — mid
print(f"Token 31 mask rate: {rates[31]:.3f}")  # 0.450 — high, likely still masked

Enter fullscreen mode Exit fullscreen mode

The Joint Training Objective

The final piece is the loss function, which combines AR and diffusion objectives:

L_total = α × L_AR + (1 - α) × L_diffusion

Enter fullscreen mode Exit fullscreen mode

  • L_AR: Standard cross-entropy next-token prediction (causal, left-to-right)
  • L_diffusion: Masked token prediction loss (bidirectional within blocks, over all masked positions simultaneously)
  • α: Typically 0.2–0.3, balancing AR capability preservation against diffusion capability acquisition

NVIDIA trained Nemotron-Labs Diffusion on 1.3 trillion tokens from the NVIDIA Nemotron Pretraining datasets, followed by supervised fine-tuning on 45 billion tokens from the NVIDIA Nemotron Post-training datasets using this framework.


5. Inside Nemotron-Labs Diffusion: Three Inference Modes

The most developer-friendly aspect of Nemotron-Labs Diffusion is that it ships as a single model checkpoint with three distinct inference personalities, selectable at deployment time with zero application-level changes.

Performance comparison chart — Autoregressive vs FastDiffuser (2.6×) vs Self-Speculation QuadSpec (6.4×) throughput on NVIDIA B200

Mode 1: Autoregressive (Baseline Compatibility)

Plain AR mode generates tokens left-to-right exactly like any standard causal LM. This mode exists for correctness validation, backward compatibility with existing AR pipelines, and A/B testing. The model's AR output quality is fully preserved by the joint training objective — it behaves identically to a native AR model.

Mode 2: FastDiffuser (Pure Diffusion Decoding)

FastDiffuser is the headline throughput mode, operating in a block-by-block denoising loop:

  1. Initialise a 32-token block with [MASK] tokens
  2. Run a full forward pass predicting all 32 positions simultaneously
  3. Accept tokens above confidence threshold τ (typically 0.9); re-mask the rest
  4. Repeat the denoising loop for up to K steps (typically 10–20)
  5. Force-commit remaining masked tokens after K steps; advance to the next block

Throughput: 2.6× higher tokens-per-forward-pass (TPF) vs. AR baseline.

The confidence threshold τ is a quality-speed lever: higher τ means more re-masking iterations (better quality, slower); lower τ means fewer iterations needed (faster, slightly lower quality).

Mode 3: Self-Speculation (LinearSpec / QuadSpec)

Self-speculation is where Nemotron-Labs Diffusion gets truly elegant. The same model plays two roles simultaneously: drafter (diffusion head) and verifier (AR head).

The draft-verify loop:

  1. Use the diffusion head to generate a candidate block of K tokens bidirectionally (fast)
  2. Run the AR head causally over the proposed block to compute acceptance probabilities
  3. Accept the longest valid prefix where AR agrees with the diffusion draft
  4. Advance by the number of accepted tokens — often the entire block — in one round trip

This is mathematically equivalent to speculative decoding: at temperature=0 the output distribution is lossless compared to pure AR. Quality is preserved by construction.

Two variants:

  • LinearSpec: ~4× AR baseline throughput
  • QuadSpec: ~6.4× AR baseline throughput — ~865 tokens/second on a single NVIDIA B200

6. Deploying DLMs with SGLang — A Practical Guide

Nemotron-Labs Diffusion is integrated into SGLang via PR #25803. Here is a complete deployment walkthrough.

Installation

# Install SGLang with DLM support
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/

# Or install from the DLM PR branch directly
git clone https://github.com/sgl-project/sglang.git
cd sglang
git fetch origin pull/25803/head:diffusion-support
git checkout diffusion-support
pip install -e ".[all]"

# Download the model
huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B-Instruct \
    --local-dir ./nemotron-diffusion-8b

Enter fullscreen mode Exit fullscreen mode

Launching the Server

# FastDiffuser mode — maximum raw throughput
python -m sglang.launch_server \
    --model-path ./nemotron-diffusion-8b \
    --port 30000 \
    --tp 1 \
    --trust-remote-code \
    --generation-mode fastdiffuser \
    --diffusion-block-size 32 \
    --diffusion-steps 10 \
    --confidence-threshold 0.9

# Self-Speculation mode — lossless + maximum throughput
python -m sglang.launch_server \
    --model-path ./nemotron-diffusion-8b \
    --port 30000 \
    --tp 1 \
    --trust-remote-code \
    --generation-mode self-speculation \
    --spec-variant quadspec \
    --draft-block-size 32

Enter fullscreen mode Exit fullscreen mode

Running Inference Across All Three Modes

import requests
import time
from typing import Literal

SGLANG_BASE_URL = "http://localhost:30000"

def generate(
    prompt: str,
    mode: Literal["ar", "fastdiffuser", "self-speculation"] = "self-speculation",
    max_tokens: int = 512,
    temperature: float = 0.0,
) -> dict:
    """
    Send a generation request to SGLang serving Nemotron-Labs Diffusion.

    Args:
        prompt: Input text prompt
        mode: Inference mode — 'ar', 'fastdiffuser', or 'self-speculation'
        max_tokens: Maximum tokens to generate
        temperature: 0.0 = greedy/deterministic (self-speculation is lossless at T=0)

    Returns:
        Dict with text output, token count, elapsed time, and throughput
    """
    payload = {
        "model": "nemotron-diffusion-8b",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "extra_body": {"generation_mode": mode},
    }

    start = time.perf_counter()
    response = requests.post(
        f"{SGLANG_BASE_URL}/v1/chat/completions",
        json=payload,
        timeout=120,
    )
    elapsed = time.perf_counter() - start
    response.raise_for_status()

    data = response.json()
    output_text = data["choices"][0]["message"]["content"]
    tokens_generated = data["usage"]["completion_tokens"]

    return {
        "text": output_text,
        "tokens_generated": tokens_generated,
        "time_seconds": round(elapsed, 3),
        "throughput_tps": round(tokens_generated / elapsed, 1),
    }


# Benchmark all three modes on the same prompt
PROMPT = """
Write a Python function implementing binary search on a sorted list.
Include type annotations, docstring, and edge case handling.
"""

for mode in ["ar", "fastdiffuser", "self-speculation"]:
    result = generate(PROMPT, mode=mode, max_tokens=300)
    print(f"\n[{mode.upper()}]")
    print(f"  Throughput : {result['throughput_tps']} tok/s")
    print(f"  Time       : {result['time_seconds']}s")
    print(f"  Preview    : {result['text'][:100]}...")

Enter fullscreen mode Exit fullscreen mode

Expected Results on NVIDIA B200 (8B, batch size 1)

[AR]
  Throughput : 136.3 tok/s
  Time       : 2.187s

[FASTDIFFUSER]
  Throughput : 353.7 tok/s   # 2.6× faster, same quality
  Time       : 0.843s

[SELF-SPECULATION]
  Throughput : 866.3 tok/s   # 6.4× faster, lossless at T=0
  Time       : 0.344s

Enter fullscreen mode Exit fullscreen mode


7. Fill-in-the-Middle and Code Infill: Why DLMs Win at Revision Tasks

One of the most practically valuable capabilities of DLMs is fill-in-the-middle (FIM) generation — producing text conditioned on both a prefix (left context) and a suffix (right context). This is critical for code completion tools, where you want to fill in a function body given the signature above and the call site below.

Why AR Models Struggle with FIM

AR models can be trained on FIM using sentinel tokens (the PSM format used in CodeLlama: <PRE> {prefix} <SUF> {suffix} <MID>). This works but the model still generates strictly left-to-right — it cannot simultaneously attend to both sides when predicting each middle token. Accuracy degrades as the infill gap widens.

DLMs Do FIM Natively

A DLM with block-wise attention genuinely attends to the committed prefix blocks and a right-side suffix context simultaneously. The infill block is initialised as masked and denoised with full bilateral awareness:

def dlm_fill_in_middle(
    prefix: str,
    suffix: str,
    fill_length: int = 64,
    sglang_url: str = "http://localhost:30000",
) -> str:
    """
    Use Nemotron-Labs Diffusion for native fill-in-the-middle code generation.
    The DLM attends to both prefix and suffix simultaneously during denoising —
    structurally impossible with pure autoregressive generation.

    Args:
        prefix: Code appearing before the section to fill
        suffix: Code appearing after the section to fill
        fill_length: Approximate number of tokens to generate
        sglang_url: SGLang server endpoint

    Returns:
        Generated middle section
    """
    payload = {
        "model": "nemotron-diffusion-8b",
        "prompt": prefix,
        "suffix": suffix,           # Native bilateral conditioning
        "max_tokens": fill_length,
        "temperature": 0.0,
        "extra_body": {
            "generation_mode": "self-speculation",
            "fim_mode": True,       # Enables bidirectional suffix conditioning
        },
    }
    response = requests.post(f"{sglang_url}/v1/completions", json=payload, timeout=60)
    response.raise_for_status()
    return response.json()["choices"][0]["text"]


# Example: Generate a function body given signature above and assertions below
prefix = '''
def merge_sorted_arrays(arr1: list[int], arr2: list[int]) -> list[int]:
    """Merge two sorted arrays into a single sorted array in O(n+m) time."""
'''

suffix = '''

# Downstream call site — the DLM sees this during generation
result = merge_sorted_arrays([1, 3, 5, 7], [2, 4, 6, 8])
assert result == [1, 2, 3, 4, 5, 6, 7, 8]
assert merge_sorted_arrays([], [1, 2]) == [1, 2]
assert merge_sorted_arrays([5], []) == [5]
'''

body = dlm_fill_in_middle(prefix, suffix)
print(f"Generated body:\n{body}")

Enter fullscreen mode Exit fullscreen mode

The DLM sees both the function signature (left) and the assertion-based call site (right) simultaneously, generating a body that satisfies both sides. Pure AR generation cannot do this without special reordering tricks.


8. Constraint Decay in Coding Agents — And How DLMs Can Help

A concurrent paper trending heavily on Hacker News this week (190+ upvotes) makes a directly relevant finding: coding agents lose ~30 percentage points in assertion pass rate as structural constraints accumulate in multi-file backend generation tasks (Papotti et al., arXiv:2605.06445).

What Constraint Decay Looks Like

The study evaluated LLM-based coding agents across 80 greenfield backend generation tasks and 20 feature implementation tasks spanning 8 web frameworks (Flask, FastAPI, Django, etc.). Key findings:

  • At baseline (minimal constraints): agents achieve ~75–80% assertion pass rate
  • As constraints accumulate (ORM schemas, architectural patterns, specific database relationships): performance drops to ~45–50% for strong models, approaching zero for weaker ones
  • Root causes are primarily data-layer defects: incorrect ORM query composition and runtime violations

This is characteristic of AR generation: early structural decisions compound into downstream violations the model cannot revise. A wrong ORM relationship in migration file 1 propagates broken query patterns through files 3, 7, and 12.

DLMs as a Structural Fix

DLMs offer two architectural responses:

1. Constraint-Aware Iterative Refinement

Because DLMs revise tokens within the block before committing them, a constraint-checking oracle inserted between denoising steps can re-mask tokens that would violate structural requirements:

from dataclasses import dataclass
from typing import Callable

@dataclass
class ConstraintOracle:
    """
    Evaluates structural constraints on partially-generated code blocks.
    Returns a per-token validity mask to guide re-denoising.
    """
    check_fn: Callable[[str], list[bool]]

    def evaluate(self, context: str, proposed_tokens: list[str]) -> list[bool]:
        """
        Returns True for each token position that should be committed,
        False for positions that violate constraints and should be re-masked.
        """
        return self.check_fn(context + "".join(proposed_tokens))


def constrained_dlm_generation(
    prompt: str,
    constraint_oracle: ConstraintOracle,
    max_tokens: int = 512,
    max_refinement_rounds: int = 5,
    sglang_url: str = "http://localhost:30000",
) -> str:
    """
    DLM-based code generation with structural constraint enforcement.

    Between denoising steps, the constraint oracle flags violating tokens
    for re-masking and re-denoising — directly addressing the constraint
    decay failure mode documented in arXiv:2605.06445.
    """
    payload = {
        "model": "nemotron-diffusion-8b",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0.1,
        "extra_body": {
            "generation_mode": "fastdiffuser",
            "constraint_refinement_rounds": max_refinement_rounds,
        },
    }
    response = requests.post(
        f"{sglang_url}/v1/chat/completions", json=payload, timeout=120
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

Enter fullscreen mode Exit fullscreen mode

2. FIM-Based Surgical Patching

When an agent detects a structural violation in a previously generated file, it can use DLM FIM to surgically patch the violation — filling in a corrected section with both the surrounding correct code as prefix and the downstream dependent code as suffix. AR models must regenerate everything from the violation point forward; DLMs can fill in the fix with bilateral awareness of all surrounding context.


9. Limitations and Open Questions

Diffusion language models are not a drop-in replacement for AR in all scenarios. Here's what developers should weigh carefully:

Latency at Small Batch Size and Short Outputs

The throughput gains of DLMs are most pronounced at high batch sizes or long output sequences. For very short outputs (<64 tokens) at batch size 1, the block-denoising overhead can narrow the advantage:

DLM_latency ≈ (N_tokens / block_size) × K_denoising_steps × forward_pass_time
             (where forward_pass_time includes full model weight load + compute)

Enter fullscreen mode Exit fullscreen mode

For latency-critical single-query applications where first-token latency matters more than throughput (interactive chat interfaces), the AR or hybrid self-speculation modes may still be preferable. Self-speculation in particular delivers strong latency and throughput benefits simultaneously.

Ecosystem Maturity

The AR ecosystem has years of accumulated tooling: GPTQ/AWQ/GGUF quantization, vLLM/TGI/Ollama serving, LoRA/QLoRA fine-tuning, and a vast practitioner community. DLM tooling is still catching up:

  • SGLang DLM support is currently in a PR (not yet merged to main as of May 2026)
  • INT8/INT4 quantization for the diffusion decoding path is under active development
  • Fine-tuning DLMs with LoRA requires modification of the standard recipes to account for the joint AR+diffusion objective

Training Complexity

While AR-to-DLM conversion lowers the barrier significantly, producing a high-quality DLM still requires a strong pretrained AR base, careful α calibration in the joint loss, tuning the position-dependent masking schedule, and large-scale continued pretraining data. Fine-tuning smaller DLMs on domain-specific data is possible but requires validated recipes still being developed by the community.

The Open Question: Revision vs. Causal Reasoning

The theoretical advantage of DLM revision is well-established for structured generation. What's less settled is whether the revision benefit materialises for tasks requiring long causal chains of reasoning — multi-step math proofs, complex algorithm design. Some preliminary results suggest AR models retain an edge in these scenarios because each token is generated with guaranteed certainty about all prior reasoning steps. This remains an active research question.


10. Conclusion — The Autoregressive Era Has Competition

Autoregressive generation is not dead. The AR paradigm will remain dominant for years, supported by massive tooling investment, a vast pretrained model zoo, and simpler training dynamics. But for the first time, there is a credible, production-grade alternative that doesn't sacrifice quality to get there.

Diffusion language models solve a real infrastructure problem: the memory-bound performance ceiling of token-by-token generation. By operating on 32-token blocks with iterative denoising, DLMs reframe text generation as a compute-bound problem — leveraging what modern GPUs are actually built for instead of fighting their memory constraints. Nemotron-Labs Diffusion makes this concrete: 6.4× throughput gains, 1.2% better accuracy than Qwen3-8B, and a three-mode inference API that requires zero application changes to adopt.

For developers building latency-sensitive applications, high-throughput batch inference pipelines, FIM-based code completion systems, or AI coding agents that struggle with structural constraint adherence — DLMs deserve serious evaluation today.

Your three next steps:

  1. 🤗 Explore the models: Visit the Nemotron-Labs Diffusion collection on HuggingFace — the 8B Instruct variant is the best starting point
  2. Benchmark it yourself: Deploy the self-speculation mode via the SGLang guide above and measure it against your current AR serving stack
  3. 📄 Read the research: Efficient-DLM (arXiv:2512.14067) is the essential paper — the community is moving fast and tooling gaps will close quickly

The age of generating text one token at a time — because that was the only way we knew how — is ending. The question isn't if diffusion language models will become standard LLM serving infrastructure, but when the ecosystem catches up to the benchmark numbers.

Based on what landed this week, that timeline just moved significantly closer.


Found this useful? Share it with your team and star the Nemotron-Labs Diffusion repo. Questions or corrections? Drop them in the comments below — I read every one.


References