Custom Kernels for All from Codex and Claude

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

ben burtenshaw, Sayak Paul, Aritra Roy Gosthipaty, shaun smith · 2026-02-13 · via Hugging Face - Blog

Back to Articles

tl;dr: We built an agent skill that teaches coding agents how to write production CUDA kernels. Then we pointed Claude and Codex at two real targets: a diffusers pipeline and a transformers model. The agents produced working kernels for both, with correct PyTorch bindings and benchmarks, end to end.

Writing CUDA kernels is hard. Writing CUDA kernels that correctly integrate with transformers and diffusers is harder. There are architecture-specific memory access patterns, vectorization strategies, warp shuffle reductions, and a dozen integration pitfalls that trip up even experienced developers. It is exactly the kind of specialized, high-stakes problem where agent skills shine.

We gave coding agents the domain knowledge they need, like which GPU architecture to target, how to structure a kernel-builder project, when to use shared memory versus registers, and how to write PyTorch bindings. The agents did the rest. If you have used the LLM training skill or read We Got Claude to Teach Open Models, the pattern will feel familiar: package domain expertise into a skill, point the agent at a problem, and let it work.

Why a skill for kernels?

The Kernel Hub solved the distribution of custom hardware kernels. You can load pre-compiled kernels from the Hub with a single get_kernel call. No builds, no flags. However, someone still needs to write the kernels. That is the gap this skill fills.

CUDA kernel development has a brutal surface area:

Hardware-specific optimization guides for each generation of GPU. H100, A100, and T4 each have different compute capabilities, shared memory sizes, and bandwidth profiles
In Libraries, diffusers and transformers have different module hierarchies, normalization conventions, and integration patterns. Custom kernels need to be registered in PyTorch for torch.compile to recognize.
For distribution, kernels can depend on CUDA, Pytorch, and Python versions creating massive environment matrices.

This is domain knowledge that gets lost in documentation tabs and Stack Overflow answers. An agent skill packages it into context that loads on demand.

First, let's show how to use the skill right away, then we'll dive into the details of how we benchmarked the kernels.

Installing the skill

The skill ships with the kernels library. Install it into your coding agent with a single command:

# we need to install kernels from main for this
pip install git+https://github.com/huggingface/kernels.git#subdirectory=kernels
kernels skills add cuda-kernels --claude

This drops the skill into .claude/skills/cuda-kernels/ where Claude Code and Cursor pick it up automatically. For other agents:

# Codex
kernels skills add cuda-kernels --codex

# OpenCode
kernels skills add cuda-kernels --opencode

# Custom destination
kernels skills add cuda-kernels --dest ./my-agent/skills/

# Install globally (available across all projects)
kernels skills add cuda-kernels --global

# Overwrite an existing installation
kernels skills add cuda-kernels --claude --force

Once installed, prompt your agent:

Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model in transformers.

Or, you can go for something more open-ended:

Build an optimized attention kernel for H100 targeting the Qwen3-8B model in transformers. Benchmark it against the PyTorch baseline and validate improvements in end-to-end performance.

The agent can read the skill, select the right architecture parameters, generate the CUDA source, write the PyTorch bindings, set up build.toml, and create a benchmark script.

If you're working on more complex kernels, or architecture-specific optimizations, that aren't covered in the skill, then the skill supplies the fundamental building blocks and patterns to get you started. We are also open to contributions on the skill itself.

What is in the skill

The skill is roughly 550 tokens of structured guidance plus reference scripts, GPU optimization guides, troubleshooting docs, and complete working examples. Agentic coding tools like Codex and Claude can read this and produce a working kernel project.

It covers:

NVIDIA GPU Architecture-aware optimization for H100, A100, and T4 (compute capabilities, memory bandwidth, shared memory sizes, block sizing)
Integration patterns for both diffusers and transformers, including the pitfalls specific to each library
Kernel templates with vectorized memory access patterns for BF16, FP16, and FP32
Benchmarking workflows for both isolated kernel micro-benchmarks and end-to-end pipeline comparisons
HuggingFace Kernel Hub integration via get_kernel for loading community kernels

.claude/skills/cuda-kernels/
├── SKILL.md                              # Main instructions (~550 tokens)
├── scripts/
│   ├── benchmark_example.py              # End-to-end benchmark template
│   ├── benchmark_rmsnorm.py              # Isolated kernel micro-benchmark
│   ├── ltx_kernel_injection_example.py   # Diffusers integration pattern
│   ├── transformers_injection_example.py # Transformers integration pattern
│   └── huggingface_kernels_example.py    # Kernel Hub integration
└── references/
    ├── diffusers-integration.md          # Diffusers guide with pitfalls
    ├── transformers-integration.md       # Transformers guide
    ├── huggingface-kernels-integration.md
    ├── h100-optimization-guide.md
    ├── a100-optimization-guide.md
    ├── t4-optimization-guide.md
    ├── kernel-templates.md
    └── troubleshooting.md

When an agent loads this, it gets everything it needs to go from "write me an RMSNorm kernel" to a buildable, benchmarkable project. It will grep and glob the skill to find the relevant files and directories. So it's important to structure the skill in a way that is easy to find.

The agent is instructed to generate kernels that conform to the templates in references/kernel-templates.md and produce a complete kernel project:

examples/your_model/
├── kernel_src/
│   └── rmsnorm.cu              # Vectorized CUDA kernel
├── torch-ext/
│   ├── your_kernels/__init__.py
│   └── torch_binding.cpp       # PyTorch C++ bindings
├── benchmark_rmsnorm.py        # Micro-benchmark script
├── build.toml                  # kernel-builder config
├── setup.py                    # pip install -e .
└── pyproject.toml

We tested this on two real targets.

Benchmarking the kernels: Diffusers (LTX-Video on H100)

The agent built RMSNorm, RoPE 3D, GEGLU, and AdaLN kernels for LTX-Video, a video generation pipeline from diffusers. The full example is at examples/ltx_video/. We optimized the RMSNorm kernel for H100. Both benchmarks were run on H100 80GB HBM3 at precision BFloat16.

If you want to check out the generated kernel, got to this example

Isolated RMSNorm benchmark

First, we compare the isolated RMSNorm kernel performance against the PyTorch baseline. This is the main speedup in the optimized pipeline.

Table

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1x1024x2048]	0.039	0.064	1.64x
[2x1024x2048]	0.040	0.073	1.82x
[4x1024x2048]	0.052	0.093	1.78x
[1x4096x2048]	0.052	0.093	1.79x
[2x4096x3072]	0.102	0.209	2.04x
[1x8192x2048]	0.083	0.150	1.81x
[4x4096x3072]	0.173	0.393	2.26x

Average speedup: 1.88x and a bandwidth efficiency: 34.7% of H100 theoretical (3,350 GB/s)

End-to-end video generation (49 frames, 30 steps, H100 80GB)

Next, we compare the end-to-end video generation performance of the optimized kernels against the baseline (no compile) and the torch.compile baseline.

Table

Configuration	Time (s)	it/s	Speedup
Baseline (no compile)	2.87	12.58	1.00x
Generated Optimized Kernels	2.70	13.52	1.06x
Baseline + torch.compile	2.14	19.05	1.34x
Optimized + torch.compile	2.01	18.45	1.43x

RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention, linear projections, and VAE decode. The 6% end-to-end speedup from a single kernel type is consistent with that profile.

Benchmarking the kernels: Transformers (Qwen3-8B on H100)

The agent built an RMSNorm kernel for Qwen3-8B, a large language model from transformers with 65 RMSNorm modules across 32 layers. The full example is at examples/qwen3_8b/. We optimized the RMSNorm kernel for H100. Both benchmarks were run on H100 80GB HBM3 at precision BFloat16.

If you want to explore the kernel, check it out here.

Isolated RMSNorm benchmark

Once again, we compare the isolated RMSNorm kernel performance against the PyTorch baseline.

Average speedup: 1.94x and a bandwidth efficiency: 22.3% of H100 theoretical (3,350 GB/s)

Table

Shape	Custom (ms)	PyTorch (ms)	Speedup
[1x128x4096]	0.040	0.062	1.58x
[1x512x4096]	0.038	0.064	1.69x
[1x1024x4096]	0.037	0.071	1.90x
[1x2048x4096]	0.045	0.091	2.03x
[1x4096x4096]	0.071	0.150	2.12x
[4x512x4096]	0.056	0.093	1.67x
[8x256x4096]	0.045	0.092	2.06x
[1x8192x4096]	0.109	0.269	2.47x

Speedup scales with sequence length: 1.58x at 128 tokens, 2.47x at 8192 tokens. For long-context inference, the custom kernel roughly halves RMSNorm latency.

Publishing your kernel to the Hub

The agent gives you a working kernel. The Kernel Hub lets you share it so anyone can load it without compilation. Here is the full path from agent output to published kernel.

1. Verify the project structure

The agent produces a project that already follows the kernel-builder layout:

your_kernel/
├── build.toml               # Build configuration
├── kernel_src/
│   └── rmsnorm.cu           # CUDA kernel source
└── torch-ext/
    ├── torch_binding.cpp    # Registers Torch ops
    └── your_kernels/
        └── __init__.py      # Python API wrapping _ops

The build.toml tells kernel-builder what to build. The agent generates this for you, including the correct cuda-capabilities for your target GPU:

[general]
name = "your_kernels"
backends = ["cuda"]

[torch]
src = ["torch-ext/torch_binding.cpp"]

[kernel.rmsnorm]
backend = "cuda"
src = ["kernel_src/rmsnorm.cu"]
depends = ["torch"]
cuda-capabilities = ["9.0"]  # H100

2. Build all variants with Nix

Kernel Hub kernels must support all recent PyTorch and CUDA configurations. The kernel-builder Nix flake handles this automatically. Copy the example flake.nix into your project and run:

nix flake update
nix run .#build-and-copy -L

This builds the kernel for every required PyTorch/CUDA variant and places the results in build/. For faster builds, enable the HuggingFace Nix cache:

nix run nixpkgs#cachix -- use huggingface

3. Create a Hub repo and push

Create a model repo on the Hub and upload the built kernel:

huggingface-cli repo create your-org/your-kernel --type model
huggingface-cli upload your-org/your-kernel ./build

4. Others load it in one line

Once published, anyone can use your kernel with zero compilation:

from kernels import get_kernel

rmsnorm = get_kernel("your-org/your-kernel")

get_kernel detects the user's Python, PyTorch, and CUDA versions and downloads the matching pre-compiled binary. No builds, no flags, typically ready in seconds.

The skill and the Hub are complementary. The skill handles development. The Hub handles distribution. Build a kernel with the skill, validate it with the benchmark scripts, publish it to the Hub, and it becomes a one-liner for everyone else.

Conclusion

We built an agent skill that teaches coding agents how to write production CUDA kernels. Then we pointed Claude and Codex at two real targets: a diffusers pipeline and a transformers model. The agents produced working kernels for both, with correct PyTorch bindings and benchmarks, end to end. We benchmarked the kernels and found that the optimized kernels can provide a speedup in both isolated and end-to-end performance.

Resources

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Why a skill for kernels?

Installing the skill

What is in the skill

Benchmarking the kernels: Diffusers (LTX-Video on H100)

Isolated RMSNorm benchmark

End-to-end video generation (49 frames, 30 steps, H100 80GB)

Benchmarking the kernels: Transformers (Qwen3-8B on H100)

Isolated RMSNorm benchmark

Publishing your kernel to the Hub

1. Verify the project structure

2. Build all variants with Nix

3. Create a Hub repo and push

4. Others load it in one line

Conclusion

Resources