惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
G
GRAHAM CLULEY
P
Privacy & Cybersecurity Law Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
宝玉的分享
宝玉的分享
P
Proofpoint News Feed
H
Help Net Security
V
Visual Studio Blog
阮一峰的网络日志
阮一峰的网络日志
C
Cisco Blogs
人人都是产品经理
人人都是产品经理
Know Your Adversary
Know Your Adversary
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Recorded Future
Recorded Future
I
Intezer
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Malwarebytes
Malwarebytes
Spread Privacy
Spread Privacy
T
Tor Project blog
V
Vulnerabilities – Threatpost
云风的 BLOG
云风的 BLOG
腾讯CDC
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
F
Future of Privacy Forum
MyScale Blog
MyScale Blog
Latest news
Latest news
IT之家
IT之家
MongoDB | Blog
MongoDB | Blog
The Hacker News
The Hacker News
S
Securelist
博客园 - 【当耐特】
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Threat Research - Cisco Blogs
Jina AI
Jina AI
Cisco Talos Blog
Cisco Talos Blog
B
Blog
博客园 - 三生石上(FineUI控件)
Last Week in AI
Last Week in AI
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
V
V2EX
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The Cloudflare Blog
The GitHub Blog
The GitHub Blog
博客园 - 聂微东
F
Full Disclosure
C
CERT Recently Published Vulnerability Notes

DEV Community

OpenSparrow v2.3 – visual admin panel, zero dependencies, now with ERD and M2M support Security Is Important. Automate It Apache Kafka for Beginners: Building Real-Time Streaming Systems with Python Dating the Crawler AI-Assisted Frontend Reviews Using Gemma 4 Building Secure Multi-Agent Systems: My Takeaways from Google I/O 2026 The Most Underrated Announcement from Google I/O 2026 Was Buried in a 90-Second Demo My Experience Building My First Token And Having it Exist On-Chain. African Creators Deserve Better: How I Built a Payment Gateway for Every Corner of the Continent React CRUD basics Should Websites Allow AI Search Crawlers? Chunking Strategies for AI Code Review on Large Repos Beyond the Prompt: How to Build Stateful AI Agents with Persistent Memory and Self-Learning Loops What 10 University Visits in Cameroon Taught Me About Building AI for the Real World, and Why Gemma 4 Was the Answer The Universal Remote for AI: A Deep Dive into the Model Context Protocol (MCP) AgentGuard 0.3.0 — macOS menu bar app, Telegram rollback, and more Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent Shopify Functions vs Shopify Scripts: A Migration Walkthrough What Actually Survives a Chicago-Area Winter on Your Deck Rethinking Geo-Blocking and Stripe's Failures in Global Access: A Cautionary Tale of Misoptimization I Built a Free Brat Generator - Here's What I Learned About Next.js Performance published Found a Second Layer to a GitHub Follow Botnet? AI Daily Digest: May 22, 2026 — Agentic Workflows, Coding Agents & Embodied AI How I Secured Internal Microservice Calls Without Passing JWTs Stop Mixing Them Up: SLI vs SLO vs SLA Explained Rebuilding My Engineering Mind Building a Music Production Ecosystem Instead of Just Releasing Plugins The Vonage Dev Discussion: How AI is transforming software development I Gave Our Enterprise AI a Memory. It Started Citing Last Quarter's Incidents. 𝐓𝐡𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐒𝐭𝐲𝐥𝐞 𝐂𝐫𝐢𝐬𝐢𝐬 Hermes Agent in the Wild: How I Turned It Into an AI Ops Employee Navigating the Hazy Jungle of Global E-commerce: How We Built a Reliable System for Digital Creators in Tanzania The Cost of Cross-Platform Development: Native Module Integration AI-Native Apps Will Swallow the Web I switched my Gemma 4 model three times in 72 hours. Here's the decision tree I wish I'd had. Inside #100DaysofSolana: A Guided Path into Web3 I Built and Shipped TinyHab: an ADHD-Friendly Habit Tracker for iOS I'm an ECE Student Who Vibe Codes Hardware Projects — Here's What Google I/O 2026 Actually Changed for Me From Fragmented Pipelines to Coherent Intelligence — Why Gemma 4 Actually Changes How I Work Our AI Inference Bill Dropped 65% After We Stopped Treating Every Query the Same Why P95 Latency Is the Only Metric That Matters at 3 AM Recycling made easy: a Polish recycling assistant powered by Gemma 4 The Complete Guide to Running a Midnight Node: Setup, Sync & Monitoring De CSRF a RCE: una visita web cuesta una shell en OpenYak Why We Built a Faster Wiki Building a Browser-Based Inkarnate Alternative for D&D Battle Maps Apache Kafka How to Build a FinTech Platform as a Solo Developer (By Any Means Necessary) Your LLM Logs Deserve Better — Send Claude Code Events to Bronto I built a free tool to track subscriptions and stop getting surprised by charges Building the TEYZIX CORE Internship Portal — My Full-Stack Development Journey PocketCFO: a private personal-finance brain that runs entirely in your browser Go Idioms I Wish I Knew Earlier Hey how are you guys I'm newbie web developer , learning wordpress+elementor Right now I don't know what to make I don't know what to write or use what color can you tell me about it ? Google I/O 2026 Blew My Mind — Here's What It Means for the Family App I'm Building 5 Things I Learned in My First Month as a Dev Intern EU AI Sovereignty Belongs in the Workflow Layer Why AI Coding Agents Need Business Context, Not Just Code Context How I Built 9 Claude AI Features into a Production SaaS Expo SDK 56 HashiCorp built an MCP server for writing Terraform. I built one for reviewing it Why Enterprise AI Agent Deployments Keep Failing Date Shear: A New Term for a Common Programming Pain Point Compass v1.1.0 · we shipped a memory plugin that catches its own consumption drift Zod Validation: Type-Safe APIs & Forms in TypeScript (Complete Guide) GitHub Actions CI/CD: Build a Complete Node.js Pipeline (2026) MCP in 2026: The numbers behind the ecosystem explosion working with an ai model mirror Learnt new things Four Metrics That Actually Tell You Whether Your Enterprise RAG Is Working Beyond the Stateless Prompt: Building an Auditable Product Intelligence Pipeline with Cascadeflow and Hindsight Most Creators Are Building in Pieces. I’m Building the Entire System. The Hidden Privacy Problem in Every AI App CVE-2026-26007: Subgroup Confinement Attack in pyca/cryptography The One Thing I See in Every Developer Who Gets Unstuck AI Memory Governance for Legal Tech: How Contract AI Agents Handle Privileged Data Two tables, zero migrations, full LINQ — a .NET data engine that's been running our production for 3 months Join the GitHub Finish-Up-A-Thon Challenge: $3,000 Prize Pool! I Replaced a $50/Month OCR API with Gemma 4’s Native Vision (And You Can Too) Building a Data-Driven Medical Image Enhancement Pipeline with Differential Evolution 🔥🩻 Why I Like Small Software Beyond the Model: Why the Gemini Ecosystem and Google AI Studio Are Redefining Enterprise AI Architecture in 2026 Complete set of Claude Skills for Solo Developer I read 50 years of network science, then built a CRM that runs entirely in the browser The New AI Workflow Is Not “More Agents” How to Make Large Time-Series Charts Smooth in Vue.js + ApexCharts (and fix Zoom & Scroll behavior issues) I Built a Cross-Platform Port Intelligence Tool to Stop Accidental Process Kills During Local Dev AI is heading toward a wall, and most people still don’t see it... Python String Methods Explained Simply (Common Operations) Why We Built a Zero-Knowledge Clipboard Manager for Developers (And Dropped Native Mobile Apps) Add Your Own Component to Bombie in 5 Edits Why Your OSS Advocacy Strategy Probably Doesn't Fit Building an MCP server for a Swiss hosting provider (and what reverse-engineering its manager taught me) Does MCP Still Matter in the AI Ecosystem? Building a Smart LRU Cache in Java: When Machines Mimic Human Memory 🧠💻 A Beginner’s Guide to Redux in React Build a Real-Time Excalidraw-like Collaborative Canvas using Velt MCP and Antigravity🎉 Using Reddit to Validate SaaS Ideas Before Building How We Built an AI That Evolves Alongside a Creator Through Memory Building a Self-Hosted AI WhatsApp Agent for Structured Invoice Extraction
How to Fix CUDA Out of Memory Errors in Stable Diffusion WebUI
Alan West · 2026-05-22 · via DEV Community

You finally got the WebUI running. You queue up a 1024x1024 generation, hit Generate, and a few seconds later your terminal vomits RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB. Cool. Cool cool cool.

I've been through this dance on three different rigs now — a 6GB laptop, an 8GB desktop, and a borrowed 12GB workstation — and the fix is almost never "buy a bigger GPU." It's usually a config problem. Let me walk you through what's actually happening and how to make it stop.

What's actually going on under the hood

When you generate an image, the diffusion model loads weights into VRAM, then the U-Net runs N denoising steps where each step holds activations, attention maps, and intermediate tensors in memory. SDXL is roughly 6.6 GB in fp16 just for the U-Net weights. Add the VAE, the text encoders (SDXL has two), and the per-step activations at full resolution, and you can easily blow past 10 GB before you've drawn a single pixel.

The really nasty part: PyTorch's allocator doesn't always release memory back to the driver between runs. So you'll have a successful generation, then the next one crashes — even though nothing changed. The fragmentation got you.

A few common root causes I've hit over and over:

  • Attention layers exploding. Default scaled dot-product attention materializes the full attention matrix, which scales quadratically with resolution.
  • Hires fix doubling everything. It runs a second generation at upscaled resolution. That second pass needs its own activations.
  • VAE decode at full precision. The default VAE can spike VRAM at the decode step, especially with --no-half-vae.
  • Other processes hogging VRAM. Your browser's hardware acceleration, a Discord overlay, or a stray Python kernel can easily eat 1-2 GB.

Step 1: Check what's actually using your VRAM

Before changing any flags, see what you're working with. On Linux or WSL:

# Snapshot current VRAM usage and which processes are holding it
nvidia-smi

# Or watch it live while a generation runs
watch -n 0.5 nvidia-smi

Enter fullscreen mode Exit fullscreen mode

On Windows, nvidia-smi.exe lives in C:\Windows\System32\ and works the same way. If your idle VRAM is already at 2 GB before you launch the WebUI, that's your first problem — kill the offenders. Browser hardware acceleration is usually the biggest one.

Step 2: Set the right command-line arguments

This is where most of the wins are. The WebUI accepts flags via webui-user.bat (Windows) or webui-user.sh (Linux/Mac). Open it up and edit COMMANDLINE_ARGS. Here's a solid starting point for an 8 GB card:

# webui-user.sh
export COMMANDLINE_ARGS="--xformers --medvram --opt-split-attention --no-half-vae"

Enter fullscreen mode Exit fullscreen mode

What each one does:

  • --xformers enables memory-efficient attention. This alone often cuts VRAM use by 30-40%. You may need to install it separately (more on that below).
  • --medvram splits the model so the U-Net, VAE, and text encoder aren't all resident at once. There's a small speed cost, maybe 10-15%, but it's the difference between generating and crashing.
  • --lowvram is more aggressive — use it on 4 GB cards. Slower, but it works.
  • --opt-split-attention chunks attention computation across the sequence dimension.
  • --no-half-vae keeps the VAE in fp32. Counterintuitive, but it prevents black-image artifacts on some GPUs that come from fp16 VAE overflow.

For xformers, if it's not auto-installing, do it manually inside the venv:

# Activate the venv first
source venv/bin/activate

# Match the torch version that the WebUI installed
pip install xformers --index-url https://download.pytorch.org/whl/cu121

Enter fullscreen mode Exit fullscreen mode

Check your installed torch version with pip show torch and grab the matching xformers build. Mismatched CUDA versions are a frequent source of "xformers installed but not used" complaints. The official xformers repo has a compatibility matrix worth bookmarking.

Step 3: Tame PyTorch's memory allocator

This is the one nobody talks about and it's saved me more times than I can count. PyTorch's CUDA caching allocator can be tuned via an environment variable. Set this before launching:

# Linux/Mac — add to webui-user.sh
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512,garbage_collection_threshold:0.8"

# Windows — add to webui-user.bat
set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.8

Enter fullscreen mode Exit fullscreen mode

The max_split_size_mb setting prevents the allocator from fragmenting memory into chunks too small to reuse. The garbage_collection_threshold triggers eager cleanup when you cross 80% utilization. I picked these numbers after a lot of trial and error on my 8 GB card — your mileage may vary, but this combo handles the "second generation crashes" pattern beautifully.

If you're writing your own inference scripts on top of diffusers, you can also force a flush manually between runs:

import torch
import gc

# Run after generation completes, before the next prompt
def cleanup_vram():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

# Optional: print what's still resident so you can debug leaks
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Enter fullscreen mode Exit fullscreen mode

Note that empty_cache() doesn't reduce memory_allocated — only memory_reserved. If allocated stays high, you've actually got tensors hanging around (probably a stray reference somewhere).

Step 4: Reduce the working set

If you've done all of the above and still hit OOM, the generation itself is just too big. Some things that actually help:

  • Drop the base resolution to 512x512 or 768x768, then use Hires fix with a 1.5x or 2x upscaler. The two-pass approach uses way less peak VRAM than generating at native 1024x1024.
  • Lower the batch size to 1. Batching is a VRAM multiplier with no quality benefit for stills.
  • Switch to a smaller model. SD 1.5 fine-tunes are 4 GB; SDXL is 6.6 GB. If you don't need SDXL's specific aesthetic, save yourself the headache.
  • Use a tiled VAE extension. It decodes the latent in chunks instead of all at once, which avoids the spike at the end of generation.

How to keep it from happening again

A few habits I've picked up:

  • Keep a known-good COMMANDLINE_ARGS in version control. I have a tiny git repo of just my WebUI configs.
  • After updating the WebUI or a major extension, do a clean run with a simple prompt before queuing up big batches. New code paths can change VRAM behavior in surprising ways.
  • Don't run a browser-based image viewer in the same session — it adds VRAM pressure you'll forget about.
  • Watch your inference logs. If memory_reserved keeps creeping up between runs, you've got a leak — usually from an extension that holds references.

The annoying truth is that VRAM management in local diffusion is mostly fiddly config, not raw hardware. A well-tuned 8 GB card will out-generate a poorly-tuned 12 GB one all day. Spend the hour up front getting your flags right and you'll save yourself dozens of crash-recoveries later.