Differential Transformer V2

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Li Dong · 2026-01-20 · via Hugging Face - Blog

Back to Articles

Tianzhu Ye, Li Dong, Yutao Sun, Furu Wei

Github Link

Notion Link (for better readability)

Code

We compare DIFF V2 with DIFF V1 below:

(For simplicity, we omit the batch dimension and assume that both the input and output of the following flash_attn_func are three-dimensional tensors (tokens, heads, head dimension). Heads belonging to the same GQA group are arranged contiguously in the output)

Note DIFF V2 subtracts two heads that are in the same GQA group, which means they share the same key and value. This is crucial to performance. See design ablations section and Github code.

def DiffAttnV1(
        layer_index, q1, q2, k1, k2, v,
        lam_q1, lam_k1, lam_q2, lam_k2,
):
        """
      q1, q2: (N, h/2, d)
      k1, k2: (N, h_kv/2, d)
      v:      (N, h_kv/2, 2d)
      lam_*: (d,)
      """
      attn1 = flash_attn_func(q1, k1, v)
        attn2 = flash_attn_func(q2, k2, v)
        
        lam_init = 0.8 - 0.6 * \
            exp(-0.3 * layer_index)
        lam1 = exp(sum(lam_q1 * lam_k1)
    lam2 = exp(sum(lam_q2 * lam_k2)
    lam = lam1 - lam2 + lam_init
    attn = attn1 - lam * attn2
    
    attn = rmsnorm(attn)
    attn = attn * (1 - lam_init)
    return attn

def DiffAttnV2(
        q, k, v, lam
):
        """
      q:   (N, 2h, d)
      k:   (N, h_kv, d)
      v:   (N, h_kv, d)
      lam: (N, h, 1)
      """
        
        attn = flash_attn_func(q, k, v)
        attn1, attn2 = (attn[:, 0::2], 
                        attn[:, 1::2])
        
        lam_val = sigmoid(lam)
        attn = attn1 - lam_val * attn2
    return attn

Full code at: unilm/Diff-Transformer/Diff-Transformer-V2 at master · microsoft/unilm In the script, h represents number of query heads, h_kv represents number of key-value heads, and d means head dimension. The λ\lambda in DIFF V2 is projected from XX for each token each head.

DIFF V2 doubles number of query heads while maintaining number of key value heads, and the extra dimension is reduced back to h*d after the differential operation so the WOW_O projection remains the same as baseline Transformer.

Motivation

Faster Decoding & No Custom Kernels

DIFF V2 introduces additional query heads compared to the baseline Transformer, but does not increase the number of key-value (KV) heads. Since LLM decoding is typically memory-bound, this design allows DIFF V2 to achieve decoding speeds on par with standard Transformer. Besides, since head dimension is aligned between query, key and value, there is no need for custom attention kernels for DIFF V2. In contrast, DIFF V1 can be slower during decoding because the value cache must be loaded twice, and a custom attention kernel is needed. DIFF V2 can also increase the arithmetic intensity of the attention module during decoding.

During pretraining, when using cutting-edge FlashAttention kernels on H-series and B-series GPUs, the throughput reduction introduced by DIFF V2 is negligible. For long-sequence prefilling, we recommend combining DIFF V2 with techniques such as YOCO (also used in Gemma 3n), which already reduces prefilling complexity to linear time with respect to sequence length.

An alternative perspective is to compare DIFF V2 with a Transformer that has the same query dimension 2h*d. Under this comparison, both models exhibit same attention kernel speed, while DIFF V2 has less parameters and flops in output projection.

Softmax Magnitude Constraint

In the standard Scaled Dot-Product Attention (SDPA), let Q,K,V∈Rn×dQ, K, V \in \mathbb{R}^{n \times d} be the queries, keys, and values. The context vector CC is defined as:

C=Softmax(QKTd)V=AV C = \text{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V = AV

Where A∈Rn×nA \in \mathbb{R}^{n \times n} is the attention weight matrix. Let's focus on a single row of CC, denoted as ci\mathbf{c}_i, which is a weighted sum of value vectors vj\mathbf{v}_j:

ci=∑j=1naijvj \mathbf{c}_i = \sum_{j=1}^{n} a_{ij} \mathbf{v}_j

We define the Context RMS (Root Mean Square) to represent the magnitude of this output:

RMS(ci)=1d∥ci∥2 \text{RMS}(\mathbf{c}_i) = \sqrt{\frac{1}{d} \|\mathbf{c}_i\|^2}

The weights aija_{ij} are non-negative and sum to 1 ( ∑j=1naij=1\sum_{j=1}^{n} a_{ij} = 1 ). Assume the value vectors vj\mathbf{v}_j are uncorrelated and have an RMS of 1, the Context RMS is strictly bounded in range [1n,1)[\frac{1}{\sqrt{n}},1) however the attention distribution changes:

If the attention is focused entirely on one token, the Context RMS is 11.
If the attention is spread equally across all tokens ( aij=1na_{ij} = \frac{1}{n} ), the Context RMS drops to 1n\frac{1}{\sqrt{n}}.
In other situations, the Context RMS is between 1n\frac{1}{\sqrt{n}} and 11.

In DIFF V1 we add a per-head RMSNorm on context vectors:

c^i=ciRMS(ci) \mathbf{\hat{c}}_i = \frac{\mathbf{c}_i}{\text{RMS}(\mathbf{c}_i)}

If the model learns a uniform attention distribution in a head, the Context RMS is approximately 1/n1/\sqrt{n}. To normalize this back to 11, RMSNorm must multiply the vector by a scale of n\sqrt{n}. For n=8192n = 8192, n≈90.5\sqrt{n} \approx 90.5. This means the RMSNorm layer applies a 100x magnification to the output. In large-scale pretraining, we find this leads to massive gradients and numerical instability.

A typical phenomenon is that when DIFF V1 is pre-trained at a large learning rate, the gradient norm experiences a larger increase compared to Transformer in the later stages, along with higher variance. In DIFF V2, after removing the per-head RMSNorm, the gradient norm scale becomes comparable to that of Transformer, and the gradient norm spike is reduced (will be discussed further below).

We adopted the per-head RMSNorm design in DIFF V1 primarily because of the doubled value head dimension and the globally shared λ\lambda across all tokens. Given the modifications made to these two aspects in DIFF V2, we found that removing RMSNorm is now safe.

Beyond Softmax Constraint & Elimination of Attention Sinks

We demonstrate DIFF V2 can overcome the constraint of Softmax mentioned above. It can also help eliminate attention sinks.

In original Softmax attention:

aij=Softmax(zij)=exp⁡(zij)∑k=1nexp⁡(zik)ci=∑j=1naijvj=∑j=1nSoftmax(zij)vjRMS(ci)∈[1n,1) a_{ij} = \text{Softmax}(z_{ij}) = \frac{\exp(z_{ij})}{\sum_{k=1}^{n} \exp(z_{ik})} \\ \mathbf{c}_i = \sum_{j=1}^{n} a_{ij} \mathbf{v}_j = \sum_{j=1}^{n} \text{Softmax}(z_{ij}) \mathbf{v}_j \\ \text{RMS}(\mathbf{c}_i) \in \left[\frac{1}{\sqrt{n}},1\right)

In DIFF V2 we introduce a projected λ\lambda for each token and each head:

ci=∑j=1n(Softmax(zij1)−sigmoid(λi)⋅Softmax(zij2))vjRMS(ci)∈(0,2) \mathbf{c}_i = \sum_{j=1}^{n} \left( \text{Softmax}(z_{ij}^\text{1}) - \text{sigmoid}(\lambda_i) \cdot \text{Softmax}(z_{ij}^\text{2}) \right) \mathbf{v}_j \\ \text{RMS}(\mathbf{c}_i) \in \left(0, \sqrt{2}\right)

The projected λi\lambda_i helps to control the context RMS. We observe that lowering the lower bound of the context RMS to zero is particularly important. It can help eliminate attention sinks and improve training stability. The upper bound only needs to remain bounded.

Note that our analysis here consider RMS before output projection WOW_O. Although the RMS can be recovered and adjusted after the output projection, the lack of freedom at Softmax still affects the learning performance.

Other recent works alleviate this constraint as well:

In Attention Is Off By One:

aijoff=exp⁡(zij)1+∑k=1nexp⁡(zik) ci=∑j=1naijoffvj=∑k=1nexp⁡(zik)1+∑k=1nexp⁡(zik)∑j=1nSoftmax(zij)vj RMS(ci)∈(0,1) a_{ij}^{\text{off}} = \frac{\exp(z_{ij})}{1 + \sum_{k=1}^{n} \exp(z_{ik})} \\ \ \\ \mathbf{c}_i = \sum_{j=1}^{n} a_{ij}^{\text{off}} \mathbf{v}_j = \frac{\sum_{k=1}^{n} \exp(z_{ik})}{1 + \sum_{k=1}^{n} \exp(z_{ik})} \sum_{j=1}^{n} \text{Softmax}(z_{ij}) \mathbf{v}_j \\ \ \\ \text{RMS}(\mathbf{c}_i) \in \left(0, 1\right)

In gpt-oss, a learnable scalar ss is introduced for each head:

aijoss=exp⁡(zij)exp⁡(s)+∑k=1nexp⁡(zik) ci=∑j=1naijossvj=∑k=1nexp⁡(zik)exp⁡(s)+∑k=1nexp⁡(zik)∑j=1nSoftmax(zij)vj RMS(ci)∈(0,1) a_{ij}^{\text{oss}} = \frac{\exp(z_{ij})}{\exp(s) + \sum_{k=1}^{n} \exp(z_{ik})} \\ \ \\ \mathbf{c}_i = \sum_{j=1}^{n} a_{ij}^{\text{oss}} \mathbf{v}_j = \frac{\sum_{k=1}^{n} \exp(z_{ik})}{\exp(s) + \sum_{k=1}^{n} \exp(z_{ik})} \sum_{j=1}^{n} \text{Softmax}(z_{ij}) \mathbf{v}_j \\ \ \\ \text{RMS}(\mathbf{c}_i) \in \left(0, 1\right)

In Gated Attention, a projected element-wise sigmoid gate is multiplied:

ci=sigmoid(gi)⊙∑j=1nSoftmax(zij)vjRMS(ci)∈(0,1) \mathbf{c}_i = \text{sigmoid} (\mathbf{g}_i) \odot \sum_{j=1}^{n} \text{Softmax}(z_{ij}) \mathbf{v}_j \\ \text{RMS}(\mathbf{c}_i) \in \left(0, 1\right)

Experimental Observations

We conduct pretraining experiments on production-scale LLMs, including dense models and a 30A3 MoE on trillions of tokens using large learning rate of 6e-4 to 1e-3.

The experiments are still running. What we have observed now:

Notably lower language modeling loss compared to Transformer (a gap of 0.02 to 0.03 at 1T training tokens).
Reduced loss and gradient spikes during training, particularly under large learning rate settings where the Transformer baseline becomes unstable.
Reduced activation outliers magnitude.

We expect to explore in later stages of training:

Learning efficiency in mid- and post-training.
Performance on downstream long-context benchmarks (alleviating context rot).

Discussions

Construction of Differential Operation

In theory, a standard Transformer with 2h2h attention heads can learn the differential operation by learning WO2i=−WO2i+1,i=0,1,…,h−1W_O^{2i}=-W_O^{2i+1}, i=0,1,\ldots,h-1, where WOiW_O^{i} denotes the output projection of head ii, and head 2i2i and 2i+12i+1 belong to the same GQA group.

Assumption 1. In practice, such a solution is difficult to learn through optimization, as it requires two sets of parameters to converge to exact negatives of each other.

Assumption 2. The differential operation can be learned by the model and the model chooses to learn it in the training. Then explicitly constructing it before the output projection as in DIFF V2 can save half of the WOW_O parameters. The number of saved parameters is also non-trivial. Under the current GQA setting, the parameters in the attention module are dominated by WQW_Q and WOW_O; Therefore, approximately 25% of the attention-module parameters can be saved. The saved parameter budget can then be reallocated to other parts of the model.

Even if DIFF V2, after reallocating parameters, does not achieve a lower loss than the baseline but merely matches it, the method is still worthwhile if it provides additional benefits such as improved training stability, better control of outliers, or higher training efficiency. This is analogous to GQA, which matches the loss of MHA while reducing KV-cache as an additional benefit. So the key question becomes empirical performance.

Design Ablations

Subtracting two heads that are not in the same GQA group, which means they do not share the same key and value.

# Ablation 1
# ❌ Wrong Implementation of DIFF V2!
...
attn = flash_attn_func(q, k, v)
nh = attn.size(1)
attn1, attn2 = (attn[:, :nh//2], 
                    attn[:, nh//2:])
...

# DIFF V2
# ✅ Correct Implementation of DIFF V2
...
attn = flash_attn_func(q, k, v)

attn1, attn2 = (attn[:, 0::2], 
                    attn[:, 1::2])
...

In our large learning rate setting, the ablation 1 setting exhibits obvious training instability (much more loss and gradient spikes) and higher loss comparing to DIFF V2. The value should be shared in the two subtraction heads to construct differential operation, as discussed in DIFF V1 paper.

Subtracting two attention maps without λ\lambda scaling factor, i.e., attn1 - attn2 instead of attn1 - lam_val * attn2. This results in an excessively small context RMS at initialization.
Directly using projected λ\lambda without applying sigmoid operation. The context RMS is unbounded from above.

Both ablation 2 and ablation 3 lead to higher language modeling loss than DIFF V2. Ablation 2 maintains training stability similar to DIFF V2, whereas ablation 3 is less stable (still more stable than ablation 1).

A Transformer with 1.5*h heads which aligns parameter with DIFF V2.

Ablation 4 also has higher training loss comparing to DIFF V2.

Miscellaneous

In DIFF, the outliers in qk logits can be smaller than those in the baseline. This was already analyzed in DIFF V1: DIFF can achieve attention sparsity comparable to the baseline while using smaller qk logits. We further propose that DIFF's differential mechanism, which cancels out small attention values, may help mitigate the attention rounding error issue discussed in this blog and paper.
DIFF V2 is compatible with sparse attention. In many existing sparse attention frameworks, query heads within the same GQA group are required to attend to the same key-value blocks in order to maximize speedup. A common strategy is to select key-value blocks based on the average attention logits across heads. For DIFF V2, the problem shifts to designing an effective block-selection strategy for a larger GQA group that contains pairs of differential heads. This may require handling the two types of differential heads separately during selection, or maybe a simple average of attention logits might already be sufficient in practice. Conceptually, this does not introduce any fundamental differences compared to block sparse attention of standard Transformers.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Code

Motivation

Faster Decoding & No Custom Kernels

Softmax Magnitude Constraint

Beyond Softmax Constraint & Elimination of Attention Sinks

Experimental Observations

Discussions

Construction of Differential Operation

Design Ablations

Miscellaneous