惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Help Net Security
Help Net Security
V
V2EX
博客园 - 叶小钗
博客园 - 司徒正美
云风的 BLOG
云风的 BLOG
F
Full Disclosure
博客园 - 聂微东
宝玉的分享
宝玉的分享
有赞技术团队
有赞技术团队
U
Unit 42
Jina AI
Jina AI
Engineering at Meta
Engineering at Meta
H
Help Net Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
P
Proofpoint News Feed
Last Week in AI
Last Week in AI
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
B
Blog RSS Feed
Recent Announcements
Recent Announcements
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
F
Fortinet All Blogs
月光博客
月光博客
Microsoft Security Blog
Microsoft Security Blog
The Cloudflare Blog
爱范儿
爱范儿
J
Java Code Geeks
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
大猫的无限游戏
大猫的无限游戏
博客园 - 三生石上(FineUI控件)
GbyAI
GbyAI
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
酷 壳 – CoolShell
酷 壳 – CoolShell
V
Visual Studio Blog
B
Blog
D
DataBreaches.Net
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
雷峰网
雷峰网
T
The Blog of Author Tim Ferriss
S
SegmentFault 最新的问题
A
About on SuperTechFans
Cloudbric
Cloudbric
人人都是产品经理
人人都是产品经理
S
Schneier on Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
P
Privacy International News Feed
Know Your Adversary
Know Your Adversary

MarkTechPost

A Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation Qwen AI Releases Qwen-Scope: An Open-Source Sparse AutoEncoders (SAE) Suite That Turns LLM Internal Features into Practical Development Tools A Coding Deep Dive into Agentic UI, Generative UI, State Synchronization, and Interrupt-Driven Approval Flows Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes A Coding Implementation on Pyright Type Checking Covering Generics, Protocols, Strict Mode, Type Narrowing, and Modern Python Typing IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI That Supports fMRI, M/EEG, Spikes, and HuggingFace Embeddings smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3 A Coding Implementation on Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters Top 10 Physical AI Models Powering Real-World Robots in 2026 How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control Meet Talkie-1930: A 13B Open-Weight LLM Trained on Pre-1931 English Text for Historical Reasoning and Generalization Research Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo The LoRA Assumption That Breaks in Production How to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models RAG Without Vectors: How PageIndex Retrieves by Reasoning A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost How to Design a Production-Grade CAMEL Multi-Agent System with Planning, Tool Use, Self-Consistency, and Critique-Driven Refinement Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks A Detailed Implementation on Equinox with JAX Native Modules, Filtered Transforms, Stateful Layers, and End-to-End Training Workflows Next Leap to Harness Engineering: JiuwenClaw Pioneers ‘Coordination Engineering’ Photon Releases Spectrum: An Open-Source TypeScript Framework that Deploys AI Agents Directly to iMessage, WhatsApp, and Telegram OpenAI Open-Sources Euphony: A Browser-Based Visualization Tool for Harmony Chat Data and Codex Session Logs Hugging Face Releases ml-intern: An Open-Source AI Agent that Automates the LLM Post-Training Workflow A Coding Implementation to Build a Conditional Bayesian Hyperparameter Optimization Pipeline with Hyperopt, TPE, and Early Stopping Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer How TabPFN Leverages In-Context Learning to Achieve Superior Accuracy on Tabular Datasets Compared to Random Forest and CatBoost A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG A Coding Guide for Property-Based Testing Using Hypothesis with Stateful, Differential, and Metamorphic Test Design Anthropic Releases Claude Opus 4.7: A Major Upgrade for Agentic Coding, High-Resolution Vision, and Long-Horizon Autonomous Tasks Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows Top 19 AI Red Teaming Tools (2026): Secure Your ML Models A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research Building Transformer-Based NQS for Frustrated Spin Systems with NetKet UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size How to Build a Universal Long-Term Memory Layer for AI Agents Using Mem0 and OpenAI A Coding Implementation to Build Multi-Agent AI Systems with SmolAgents Using Code Execution, Tool Calling, and Dynamic Orchestration A Technical Deep Dive into the Essential Stages of Modern Large Language Model Training, Alignment, and Deployment Google AI Launches Gemini 3.1 Flash TTS: A New Benchmark in Expressive and Controllable AI Voice Google DeepMind Releases Gemini Robotics-ER 1.6: Bringing Enhanced Embodied Reasoning and Instrument Reading to Physical AI Google Launches ‘Skills’ in Chrome: Turning Reusable AI Prompts into One-Click Browser Workflows A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction TinyFish AI Releases Full Web Infrastructure Platform for AI Agents: Search, Fetch, Browser, and Agent Under One API Key NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model A Hands-On Coding Tutorial for Microsoft VibeVoice Covering Speaker-Aware ASR, Real-Time TTS, and Speech-to-Speech Pipelines Meta AI and KAUST Researchers Propose Neural Computers That Fold Computation, Memory, and I/O Into One Learned Model A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction MiniMax Just Open Sourced MiniMax M2.7: A Self-Evolving Agent Model that Scores 56.22% on SWE-Pro and 57.0% on Terminal Bench 2 Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5× Higher Throughput How to Build a Secure Local-First Agent Runtime with OpenClaw Gateway, Skills, and Controlled Tool Execution How Knowledge Distillation Compresses Ensemble Intelligence into a Single Deployable AI Model Alibaba’s Tongyi Lab Releases VimRAG: a Multimodal RAG Framework that Uses a Memory Graph to Navigate Massive Visual Contexts A Coding Guide to Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim NVIDIA Releases AITune: An Open-Source Inference Toolkit That Automatically Finds the Fastest Inference Backend for Any PyTorch Model Five AI Compute Architectures Every Engineer Should Know: CPUs, GPUs, TPUs, NPUs, and LPUs Compared An End-to-End Coding Guide to NVIDIA KVPress for Long-Context LLM Inference, KV Cache Compression, and Memory-Efficient Generation Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context A Coding Guide to Build Advanced Document Intelligence Pipelines with Google LangExtract, OpenAI Models, Structured Extraction, and Interactive Visualization Google AI Research Introduces PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export
How to Build Memory-Efficient Transformers with xFormers Using Packed Sequences, GQA, ALiBi, SwiGLU, and Causal Attention
Sana Hassan · 2026-06-17 · via MarkTechPost

In this tutorial, we implement xFormers: a practical toolkit for building fast, memory-efficient Transformer models on GPUs. We begin by validating memory-efficient attention against a standard attention implementation, then compare their speed and memory consumption across different sequence lengths. We then examine causal masking, packed variable-length sequences, grouped-query attention, and custom ALiBi positional biases. Finally, we combine these techniques into a trainable GPT-style model that uses xFormers attention, SwiGLU feed-forward layers, and automatic mixed-precision training.

Setting Up xFormers and Validating Memory-Efficient Attention

import subprocess, sys
def _pip(*a): subprocess.run([sys.executable, "-m", "pip", "install", *a], check=False)
try:
   import xformers
except Exception:
   _pip("-q", "-U", "xformers")
import math, time
import torch, torch.nn as nn, torch.nn.functional as F
import xformers, xformers.ops as xops
from xformers.ops import fmha
ab = fmha.attn_bias
assert torch.cuda.is_available(), (
   "No GPU detected. In Colab: Runtime → Change runtime type → GPU, then re-run.")
device = "cuda"
torch.manual_seed(0)
print("torch    :", torch.__version__)
print("xformers :", xformers.__version__)
print("GPU      :", torch.cuda.get_device_name(0))
print("\n--- xformers.info (which kernels are built/available) ---")
try:
   subprocess.run([sys.executable, "-m", "xformers.info"], check=False)
except Exception as e:
   print("xformers.info unavailable:", e)
def cuda_time(fn, iters=20, warmup=5):
   for _ in range(warmup): fn()
   torch.cuda.synchronize()
   s, e = (torch.cuda.Event(enable_timing=True) for _ in range(2))
   s.record()
   for _ in range(iters): fn()
   e.record(); torch.cuda.synchronize()
   return s.elapsed_time(e) / iters
def peak_mem_mb(fn):
   torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
   fn(); torch.cuda.synchronize()
   return torch.cuda.max_memory_allocated() / 1e6
def vanilla_attention(q, k, v, causal=False):
   """Reference attention that MATERIALIZES the [B,H,M,M] score matrix.
      Inputs are xformers-layout [B, M, H, K]."""
   q, k, v = (t.transpose(1, 2).float() for t in (q, k, v))
   scores = (q @ k.transpose(-2, -1)) / math.sqrt(q.shape[-1])
   if causal:
       M = scores.shape[-1]
       m = torch.triu(torch.ones(M, M, device=q.device, dtype=torch.bool), 1)
       scores = scores.masked_fill(m, float("-inf"))
   out = scores.softmax(-1) @ v
   return out.transpose(1, 2)
print("\n" + "="*70 + "\n1. memory_efficient_attention basics + correctness\n" + "="*70)
B, M, H, K = 2, 512, 8, 64
q, k, v = (torch.randn(B, M, H, K, device=device, dtype=torch.float16) for _ in range(3))
out_xf  = xops.memory_efficient_attention(q, k, v)
out_ref = vanilla_attention(q, k, v).half()
print("output shape         :", tuple(out_xf.shape), "(layout B, M, H, K)")
print("max abs diff vs ref  : {:.2e}".format((out_xf - out_ref).abs().max().item()))
print("-> it's EXACT attention (fp16 rounding only), just computed without")
print("   ever storing the full MxM score matrix.")

We install and import xFormers, verify GPU availability, and inspect the attention kernels supported by the environment. We define helper functions for measuring CUDA execution time and peak memory consumption. We then validate memory-efficient attention against standard attention to confirm that both produce results that closely match each other.

Benchmarking Memory and Speed Against Naive Causal Attention

print("\n" + "="*70 + "\n2. Memory & speed vs naive attention (fwd+bwd)\n" + "="*70)
print(f"{'seqlen':>8} | {'naive MB':>10} | {'xformers MB':>12} | {'naive ms':>9} | {'xf ms':>7}")
print("-"*60)
for M in [512, 1024, 2048, 4096]:
   q, k, v = (torch.randn(2, M, 8, 64, device=device, dtype=torch.float16,
                          requires_grad=True) for _ in range(3))
   def run_xf():
       o = xops.memory_efficient_attention(q, k, v); o.sum().backward()
   def run_naive():
       o = vanilla_attention(q, k, v); o.sum().backward()
   try:
       nm = peak_mem_mb(run_naive); nt = cuda_time(run_naive, 8, 3)
   except RuntimeError:
       nm, nt = float("nan"), float("nan"); torch.cuda.empty_cache()
   xm = peak_mem_mb(run_xf); xt = cuda_time(run_xf, 8, 3)
   print(f"{M:>8} | {nm:>10.0f} | {xm:>12.0f} | {nt:>9.2f} | {xt:>7.2f}")
print("-> naive memory grows ~4x per doubling of M (it stores BxHxMxM);")
print("   xformers grows ~linearly and stays fast.")
print("\n" + "="*70 + "\n3. Causal attention via LowerTriangularMask\n" + "="*70)
B, M, H, K = 2, 256, 8, 64
q, k, v = (torch.randn(B, M, H, K, device=device, dtype=torch.float16) for _ in range(3))
out_causal = xops.memory_efficient_attention(q, k, v, attn_bias=ab.LowerTriangularMask())
ref_causal = vanilla_attention(q, k, v, causal=True).half()
print("causal max abs diff  : {:.2e}".format((out_causal - ref_causal).abs().max().item()))
print("-> the mask is implicit; no MxM boolean tensor is allocated.")

We benchmark naive attention and xFormers attention across progressively longer sequences using forward and backward passes. We compare their execution times and peak GPU memory usage to observe how xFormers avoids quadratic memory growth. We also apply an implicit lower-triangular mask and verify causal attention against the reference implementation.

Packing Variable-Length Sequences and Running Grouped-Query Attention

print("\n" + "="*70 + "\n4. Variable-length packed batch — no padding waste\n" + "="*70)
seqlens = [37, 120, 8, 200]
total = sum(seqlens)
H, K = 8, 64
q = torch.randn(1, total, H, K, device=device, dtype=torch.float16)
k = torch.randn(1, total, H, K, device=device, dtype=torch.float16)
v = torch.randn(1, total, H, K, device=device, dtype=torch.float16)
try:
   bias = ab.BlockDiagonalMask.from_seqlens(seqlens)
   out_packed = xops.memory_efficient_attention(q, k, v, attn_bias=bias)
   s0 = seqlens[0]
   ref0 = vanilla_attention(q[:, :s0], k[:, :s0], v[:, :s0]).half()
   print("packed shape         :", tuple(out_packed.shape), "(all", total, "tokens, no pad)")
   print("segment-0 max diff   : {:.2e}".format((out_packed[:, :s0] - ref0).abs().max().item()))
   cbias = ab.BlockDiagonalCausalMask.from_seqlens(seqlens)
   _ = xops.memory_efficient_attention(q, k, v, attn_bias=cbias)
   print("-> also did a packed CAUSAL pass. This is how vLLM-style engines")
   print("   batch requests of different lengths with zero padding overhead.")
   splits = bias.split(out_packed)
   print("recovered segments   :", [tuple(t.shape) for t in splits])
except Exception as e:
   print("BlockDiagonalMask path skipped on this version/backend:", repr(e))
print("\n" + "="*70 + "\n5. Grouped-query attention (5-D BMGHK layout)\n" + "="*70)
B, M, K = 2, 256, 64
n_q_heads, n_kv_heads = 8, 2
G, Hq = n_kv_heads, n_q_heads // n_kv_heads
try:
   qg = torch.randn(B, M, G, Hq, K, device=device, dtype=torch.float16)
   kg = torch.randn(B, M, G, 1,  K, device=device, dtype=torch.float16)
   vg = torch.randn(B, M, G, 1,  K, device=device, dtype=torch.float16)
   out_gqa = xops.memory_efficient_attention(qg, kg, vg)
   print("GQA output shape     :", tuple(out_gqa.shape), "= [B, M, G, Hq, K]")
   print(f"-> {n_q_heads} query heads, only {n_kv_heads} KV heads: smaller KV-cache,")
   print("   which is exactly what Llama-/Mistral-class models use at inference.")
except Exception as e:
   print("GQA 5-D path skipped on this version/backend:", repr(e))

We concatenate variable-length sequences and use BlockDiagonalMask to prevent attention from crossing sequence boundaries without padding. We recover the individual outputs and also perform packed causal attention for decoder-style workloads. We then demonstrate grouped-query attention, where multiple query heads share fewer key-value heads to reduce KV-cache requirements.

Adding a Custom ALiBi Additive Positional Bias

print("\n" + "="*70 + "\n6. Custom ALiBi additive bias\n" + "="*70)
B, M, H, K = 1, 128, 8, 64
q, k, v = (torch.randn(B, M, H, K, device=device, dtype=torch.float16) for _ in range(3))
try:
   slopes = (2.0 ** (-8.0 / H)) ** torch.arange(1, H + 1, device=device)
   pos = torch.arange(M, device=device)
   rel = (pos[None, :] - pos[:, None]).clamp(max=0).float()
   alibi = slopes[:, None, None] * rel[None]
   alibi = alibi[None].expand(B, H, M, M).to(torch.float16).contiguous()
   causal = torch.triu(torch.ones(M, M, device=device, dtype=torch.bool), 1)
   alibi = alibi.masked_fill(causal[None, None], float("-inf"))
   out_alibi = xops.memory_efficient_attention(q, k, v, attn_bias=alibi)
   print("ALiBi output shape   :", tuple(out_alibi.shape))
   print("-> any per-(head,query,key) additive bias works the same way.")
except Exception as e:
   print("Custom-bias path skipped (some backends restrict bias shapes):", repr(e))

We construct a custom ALiBi tensor that applies a different linear positional penalty to each attention head. We combine this additive bias with a causal mask so that tokens attend only to valid previous positions. We pass the resulting bias directly to xFormers attention and verify the shape of its output.

Training a GPT Block with xFormers Attention and SwiGLU

print("\n" + "="*70 + "\n7. Train a small GPT block (xformers attn + SwiGLU)\n" + "="*70)
def make_swiglu(d, hidden):
   """Fused xformers SwiGLU if available, else a clean manual fallback."""
   try:
       m = xops.SwiGLU(in_features=d, hidden_features=hidden, out_features=d, bias=True)
       return m, "fused xops.SwiGLU"
   except Exception:
       class SwiGLU(nn.Module):
           def __init__(s):
               super().__init__()
               s.w12 = nn.Linear(d, 2 * hidden); s.w3 = nn.Linear(hidden, d)
           def forward(s, x):
               a, b = s.w12(x).chunk(2, -1)
               return s.w3(F.silu(a) * b)
       return SwiGLU(), "manual SwiGLU fallback"
class Block(nn.Module):
   def __init__(self, d, n_heads, mlp_mult=4):
       super().__init__()
       self.h, self.k = n_heads, d // n_heads
       self.n1, self.n2 = nn.LayerNorm(d), nn.LayerNorm(d)
       self.qkv, self.proj = nn.Linear(d, 3 * d), nn.Linear(d, d)
       self.ff, self.ff_kind = make_swiglu(d, mlp_mult * d)
   def forward(self, x):
       B, M, d = x.shape
       qkv = self.qkv(self.n1(x)).reshape(B, M, 3, self.h, self.k)
       q, kk, vv = qkv.unbind(2)
       a = xops.memory_efficient_attention(q, kk, vv, attn_bias=ab.LowerTriangularMask())
       x = x + self.proj(a.reshape(B, M, d))
       return x + self.ff(self.n2(x))
class TinyGPT(nn.Module):
   def __init__(self, vocab, d=128, n_layers=3, n_heads=8, maxlen=64):
       super().__init__()
       self.tok = nn.Embedding(vocab, d); self.pos = nn.Embedding(maxlen, d)
       self.blocks = nn.ModuleList(Block(d, n_heads) for _ in range(n_layers))
       self.nf, self.head = nn.LayerNorm(d), nn.Linear(d, vocab)
   def forward(self, idx):
       B, M = idx.shape
       x = self.tok(idx) + self.pos(torch.arange(M, device=idx.device))[None]
       for b in self.blocks: x = b(x)
       return self.head(self.nf(x))
VOCAB, SEQ = 64, 64
def make_batch(B):
   start = torch.randint(0, VOCAB, (B, 1), device=device)
   return (start + torch.arange(SEQ, device=device)[None]) % VOCAB
model = TinyGPT(VOCAB).to(device)
print("FFN type             :", model.blocks[0].ff_kind)
opt = torch.optim.AdamW(model.parameters(), lr=3e-3)
scaler = torch.amp.GradScaler("cuda")
for step in range(400):
   seq = make_batch(64); inp, tgt = seq[:, :-1], seq[:, 1:]
   with torch.autocast("cuda", dtype=torch.float16):
       logits = model(inp)
       loss = F.cross_entropy(logits.reshape(-1, VOCAB), tgt.reshape(-1))
   opt.zero_grad(); scaler.scale(loss).backward(); scaler.step(opt); scaler.update()
   if step % 80 == 0 or step == 399:
       acc = (logits.argmax(-1) == tgt).float().mean().item()
       print(f"step {step:4d} | loss {loss.item():.4f} | next-token acc {acc*100:5.1f}%")
print("-> a full causal transformer running on memory-efficient attention,")
print("   trained end-to-end with AMP. Swap in real data/tokenizer to scale up.")
print("\nDone. Sections 1-3 are core; 4-6 are the advanced bits worth keeping.")

We build a compact GPT-style Transformer using causal xFormers attention, residual connections, normalization, and SwiGLU feed-forward layers. We train the model with automatic mixed precision on a synthetic next-token prediction task that counts upward modulo the vocabulary size. We monitor its loss and accuracy to confirm that the complete memory-efficient Transformer learns successfully end-to-end.

Conclusion

In conclusion, we developed a practical understanding of how xFormers improves Transformer efficiency without changing the fundamental attention calculation. We saw how memory-efficient kernels reduce the cost of long sequences, while causal masks, packed sequences, grouped-query attention, and additive biases support realistic training and inference workflows. We concluded by integrating these capabilities into a compact GPT model and training it end-to-end, giving us a strong foundation for applying xFormers to larger language models and more demanding datasets.


Check out the Full Codes with NotebookAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.