惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 司徒正美
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
MIT News - Artificial intelligence
腾讯CDC
IT之家
IT之家
Microsoft Azure Blog
Microsoft Azure Blog
M
Microsoft Research Blog - Microsoft Research
阮一峰的网络日志
阮一峰的网络日志
H
Help Net Security
L
LangChain Blog
G
Google Developers Blog
Stack Overflow Blog
Stack Overflow Blog
人人都是产品经理
人人都是产品经理
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 【当耐特】
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
U
Unit 42
Recent Announcements
Recent Announcements
S
SegmentFault 最新的问题
大猫的无限游戏
大猫的无限游戏
博客园 - Franky
T
The Blog of Author Tim Ferriss
罗磊的独立博客
宝玉的分享
宝玉的分享
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
雷峰网
雷峰网
D
DataBreaches.Net
爱范儿
爱范儿
Schneier on Security
Schneier on Security
P
Palo Alto Networks Blog
Spread Privacy
Spread Privacy
Hugging Face - Blog
Hugging Face - Blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
K
Kaspersky official blog
P
Privacy & Cybersecurity Law Blog
博客园_首页
T
Threat Research - Cisco Blogs
I
InfoQ
有赞技术团队
有赞技术团队
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Recorded Future
Recorded Future
量子位
H
Hackread – Cybersecurity News, Data Breaches, AI and More
GbyAI
GbyAI
Cyberwarzone
Cyberwarzone
B
Blog
C
Check Point Blog
P
Proofpoint News Feed
S
Securelist
A
Arctic Wolf

DEV Community

Export all of Andoid apps GitHub Copilot Workspace Review: Task-Level AI Coding in the Browser Azure Container Apps Express: The Agent-First Platform You've Been Waiting For Vibe Coding Is Fun Until Production The creator told 2,000 people to ship in 30 days. Nobody built the structure for it. EC2 Beginner Guide: Launch Your First AWS Instance JaisCloud — A Free, Single-Binary AWS Emulator in Go Cursor IDE Review: What Makes It a Genuinely Different AI Code Editor How to Debug Complex Regex Patterns Offline Without Leaking Proprietary Data The Moment the Default Runtime Became the Payload HoneyCloud — Reviving My Final Year Cybersecurity Project Rust Kernel Modules, Ready-to-Ship: A cargo-generate Template with Tests, CI, and Zero-Panic… Genetec Security Center 5.14.0.0: What's Coming, What's Already Here, and Why I'm Still Telling Clients to Wait Day 9 - Sparse embedding continued - RAG I coded an Air Hockey game where a local SLM hacks the DOM to cheat (and trash-talks you) 🤖🏓 Docker Desktop Alternatives: OrbStack vs Colima vs Rancher Desktop The Veltrix Treasure-Hunt Engine Litmus Test Quantizing Gemma 4 on Mac with llama.cpp How I Built an AI-Powered Google Maps Scraper for Lead Generation Readsb ADS-B Aircraft Local State Archive The Veltrix Event Engine Blew Up Because We Trusted the Defaults MiniScript Weekly News — May 27, 2026 The benchmark that made me change my mind about Jakarta EE in 2026 El benchmark que me hizo cambiar de opinión sobre Jakarta EE en 2026 Building an Offline-First Bushfire Response Platform With Hermes Agent The Golang Trinity: Functions, Methods, Interfaces gitwink — a read-only tray git glance for the AI-agent era I built a free PHP & Laravel tutorial site — 49 lessons from scratch Sniffing Modbus Traffic with 5 Lines of Python (And Why It Should Scare Your OT Team) Why Hytale Treasure Hunts Explode In Production (And How We Fixed It) XGroundControlStation The Worst Time to Quit Software Engineering Might Be Right Now Closiq Discord Agent: An AI Customer Support Monolith 🚀 Why Does Using an ORM Decrease Database Performance? An Experience... Document photos are a tiny image-processing problem with sharp edges Meet the G2 Nano: A 1GHz Dev Board Built for Robotics The 34x Pricing Gap: Why AI Model Selection in 2026 Is a Math Problem, Not a Loyalty Problem I Built an Open-Source Multi-Agent Fact-Checker — Here's How It Works The Sovereign Privacy Illusion: Why GDPR Compliance Doesn’t Equal Data Control Mastering Azure Entra ID: A Hands-On Guide to User Management and Privileged Roles Solana Transactions Are Not What I Expected (Coming From EVM) How We Broke the Hytale Treasure Hunt Engine (And Fixed It at 3 AM) AI Memory Is Broken. Here's What's Finally Starting to Fix It State Management in Production Flutter Apps: What Actually Held Up at Scale The Role of QA in the New AI SDLC We spent 6 months feeding our compliance data to a major cloud AI. Here's what we got back. open-source coding agents need maintainers, not just models Oracle ORA-00018 오류 원인과 해결 방법 완벽 가이드 I built agmsg so Claude Code and Codex could stop using me as a copy-paste relay Gmail ate my formatting. So I built PasteClean. Amazon Bedrock AgentCore Payments: The Spending Limit Is the Product I Should Have Put Events in the Same Database as the Aggregate Root—Heres What Happened The SilentRecon Agent Loop Architecture: How We Build AI That Doesn’t Stall How I Built a Dark Cinematic Restaurant Landing Page Template and Listed It For Sale Claude Code Slash Commands You Should Know (I wasn't either) "7 Free GitHub Repos Every Beginner Should Star Right Now" Blockless Scope: JavaScript Shenanigans 🧠 Plano Físico vs Plano Funcional en AWS Networking: cómo los arquitectos senior resuelven lo que otros no ven. I Accidentally Built an AI Employee Out of Scripts and Bad Sleep Habits Stop Chasing Shiny Tools: A Minimalist AI Stack That Actually Makes You Money Day 7 of trying to get 20 paying customers in 40 days. Currently at 0. How We Blew Up Our Event Pipeline at 3 AM Because the Treasure Hunt Engine Had No Clear Operator Bounds Autograder - Finish-Up-A-Thon Federico@Cursor,Dimma@Fireworks深入探讨Composer2技术 Hermes Memory Providers: A Complete Breakdown for New Users How I Built a Cinematic Scroll Experience with GSAP and ScrollTrigger I Built a Free Spelling Bee Solver and Analysis Tool — Here's What It Does Stop Over-Engineering Your UI: Material 3 for Blazor (Without the JS State Management Nightmare) I just created the best web FullStack framework in Rust language: the Rullst! I did with the help of AI, but my tokens are over, can you help me? ASF Project Spotlight: Apache Iceberg babelForge TIL 5/27/2026 Broken Software I built a CLI that scaffolds agentic workflows for Claude Code Testing a LiveView App with Playwright: Fixing Navigation Timeouts I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of My Tokens. How I monitor CVEs daily with a 50-line Python script Apache Geode 2.0, Part II: Rebuilding a Distributed System for the Modern Java Era HiTerm: A Free Remote Terminal for AI Coding Agents (Claude Code, Codex, Gemini CLI) 5 Free Online Tools You Didn't Know About — No Signup How I Finished My AI Code Reviewer Using GitHub Copilot The auth_rls_initplan linter has a blind spot: SECURITY DEFINER bodies Upgrading OtakuShelf to JHipster 9.1.0 Polishing the catalog (and reading the agent's receipts) Adding the anime side without holding my breath Pairing up: scaffolding OtakuShelf with an agent State.js Tutorial: Creating Reusable UI Components with Pure CSS Reactivity Deskbrid: A Linux Desktop HAL Built Entirely by AI Agents The Day the Treasure Hunt Engine Drowned in 300 ms Queries How I Built a Marriage Biodata PDF Generator in Next.js Supply Chain & AI Security: GlassWorm Takedown, Prompt Injection RCE, Ubuntu 24 Hardening AI Agent Production Challenges: Failures, Starlette Vulnerability, Code Gen SQLite Bugfix, PostgreSQL Migrations & Filesystem API Paradigm CUDA 13.3 Lands, AI Writes Blackwell Kernels, & FP4 VRAM Optimization for LLMs How I Found a Fake Job Assessment Repo Hiding Malware Inside SVG Files Building the Pipes: Core Data Engineering Concepts Explained Ultimate 1-Minute Xray/3x-ui Setup: VLESS, Hysteria2, Caddy Self-Steal & Smart Outbounds in One Script [Boost] Vamos falar de IA. Mas de outro lugar. Stop Duplicating Code! Is "Integration Hell" Just Laziness or a Systemic Architecture Failure?
NVIDIA CUTLASS: High-Performance CUDA Templates for AI Linear Algebra
pickuma · 2026-05-28 · via DEV Community

If you've trained a transformer in the last three years, your GPU spent most of its wall-clock time inside a matrix multiplication. The kernels doing that work were probably written by cuBLAS, generated by a compiler stack like Triton, or hand-assembled on top of NVIDIA's CUTLASS templates. CUTLASS is the one most people don't see directly, but it sits underneath a surprising amount of modern AI infrastructure — from FlashAttention to vLLM to several internal kernels inside PyTorch.

What CUTLASS actually is

CUTLASS — CUDA Templates for Linear Algebra Subroutines — is a header-only C++ template library NVIDIA publishes on GitHub under Apache 2.0. It is not a drop-in replacement for cuBLAS. cuBLAS gives you a closed-source binary with a stable API: you call cublasGemmEx and you get a tuned kernel. CUTLASS gives you the building blocks to write your own kernel, with control over tile sizes, data layouts, epilogues, and how the kernel decomposes work across the GPU's memory hierarchy.

That control is the point. If you're building a custom inference engine and your projection layer needs to fuse a GEMM with a SiLU activation and a residual add, cuBLAS can't fuse the epilogue for you — you'd launch the GEMM, then a separate elementwise kernel, paying twice for global memory traffic. With CUTLASS, the epilogue is a template parameter. You write the fusion once, instantiate the template, and the compiler emits a single kernel.

This is why CUTLASS shows up wherever standard cuBLAS shapes don't fit — unusual data types like FP8, custom epilogues, sparse or grouped GEMMs, attention-shaped matrix products. Anywhere the stock library doesn't have what someone needs and the performance ceiling matters, you tend to find a CUTLASS kernel.

CUTLASS is not a productivity library. It is a kernel-author's library. If you're writing PyTorch model code, you'll never import cutlass directly — you'll consume kernels that were built on top of it. The audience here is people who write the kernels other people import.

The hierarchy that makes CUTLASS work

A modern GPU is not flat. An NVIDIA H100 SXM has 132 streaming multiprocessors (SMs), each holding warps of 32 threads, with a tiered memory system spanning registers, shared memory, L2, and HBM. A well-tuned GEMM has to decompose the same multiplication problem at every level of that hierarchy and pick tile sizes that keep the tensor cores fed without spilling.

CUTLASS encodes this hierarchy directly into its type system:

  • Device-level templates describe the full GEMM problem and dispatch to a kernel grid.
  • Kernel-level templates describe how a single grid block divides its work.
  • Threadblock-level templates describe the tile each block computes, plus the shared-memory staging pattern.
  • Warp-level templates map onto tensor core MMA instructions — mma.sync on Ampere, wgmma on Hopper.
  • Thread-level templates handle per-thread accumulation and the epilogue.

Each layer takes the layer below it as a template parameter. The compiler instantiates the whole stack at build time, so you pay no virtual-dispatch overhead at runtime — the cost is build time and binary size. A non-trivial CUTLASS kernel can take tens of seconds to compile and produce a multi-megabyte object file. Teams ship CUTLASS-based libraries with ahead-of-time-generated kernels for the shapes they care about, rather than JIT-compiling per request.

The payoff is performance close to what NVIDIA's own profiler reports as the achievable peak for a given shape, with full control over how the kernel behaves. cuBLAS will silently fall back to a generic kernel for unusual shapes; CUTLASS lets you write the specialized one and own the result.

CuTe and the Python DSL

CUTLASS 3.x, released around the Hopper launch, introduced CuTe — short for CUDA Tensors. CuTe is a lower-level tensor algebra library that replaces a lot of the hand-rolled layout math in earlier CUTLASS versions. Instead of writing pointer arithmetic and indexing logic by hand, you describe a layout as a composition of shapes and strides, and CuTe handles the rest.

If you've worked with Triton's block-pointer API or with XLA's HLO layouts, CuTe will feel familiar in spirit, but it operates at a lower level — it's designed to give you the same control you'd have writing inline PTX, with composable abstractions instead of macros. Most new CUTLASS kernels targeting Hopper and Blackwell tensor cores are written using CuTe primitives rather than the older threadblock-level abstractions.

CUTLASS 4.x went further and added a Python DSL. You write kernels in a constrained subset of Python that JIT-compiles down to the same template stack the C++ library uses. This is aimed at researchers who want to prototype a kernel shape without setting up an NVCC build environment, and at framework authors who want to generate kernels programmatically.

The Python DSL is newer and the documentation is still catching up to the C++ side. If you're building production kernels today, the C++ template library is the better-documented path. The Python DSL is worth tracking if you prototype kernels often or generate them at build time.

When to reach for CUTLASS — and when not to

CUTLASS is the right tool when three things are true at once: you need a GEMM-shaped computation, you need control cuBLAS doesn't expose, and you're willing to spend the engineering time to tune kernels. If any of those is false, reach for something else.

  • For standard matrix multiplication in standard data types, cuBLAS is faster to integrate and usually within a few percent of a hand-tuned CUTLASS kernel.
  • For experimentation with custom kernels in Python, Triton has a gentler ramp and a much faster compile loop.
  • For attention specifically, FlashAttention and similar published kernels are likely already what you'd build.
  • For non-NVIDIA hardware, CUTLASS is a non-starter — it's CUDA-only by design.

The teams that get the most out of CUTLASS are the ones building inference engines, training frameworks, or specialized kernels for novel data types — the cases where the standard library doesn't have what you need and the gap between "close to peak" and "actually at peak" shows up in the GPU bill.


Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.