惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Serverless Mental Models: What They Don't Tell You Before You Build Preventing GPT hallucination in automated content pipelines: how I structure Make.com flows with data injection Making Claude Sound Like Optimus Prime Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions Learning Progress Pt.20 How Secure LoRa Communication Devices Work: Building the Future of Private and Long-Range Connectivity Author: Shivam Wakade | Founder, PrivSR How I Rebuilt an RPG Map Editor with Rust, React, and WASM Building a System That Automates YouTube Post-Production Building a 100% Serverless Digital Asset Packager in the Browser Game Recommended AI What is Human-In-The-Loop (HITL)? Deep Dive: React Server Components in TanStack Start Migrating off Google Analytics: Umami vs Plausible vs Fathom Building a Portfolio That Actually Demonstrates Software Engineering Async/Await in JavaScript: From Callbacks to Clean Code (2026) Benchmarking LLM Structured Outputs Angular 21 Multiselect Dropdown: A Migration-Friendly Component with Live Functional Tests ShareBox v5 — GPU transcoding, Netflix-style grid, and why I don't need Plex anymore TOML Schema is live Handling Duplicate Shopify Webhook Events (And Why You Must) Original Kubernetes Dashboard — retired upstream, upgraded to Angular 21. لماذا أسست ترينافو للتجار العرب الذين تتجاهلهم المنصات الغربية Construyendo un recomendador de películas en Python: de los datos al modelo When APIs Lie: A Lesson in Defensive Debugging Pope Leo XIV's AI Encyclical: What Builders Must Know (2026) Donna v0.3.0 HTB — MonitorsFour | Writeup The Free Tool You Trust Is the One You Should Fear the Most HTB — MonitorsFour | Writeup Fr 97. Embeddings and Vector Search: Semantic Search That Works Deep Dive: Building "Gravity Paint" - A Tactile Physics Instrument with React, Matter.js, and p5.js ABAP Unit Testing with Test Doubles and Mocking Frameworks: A Senior Architects Guide to Isolating Dependencies in SAP S/4HANA LeetCode Solution: 5. Longest Palindromic Substring kovax-react 0.8: Tailwind v4 preset, FormField adapters, ColorModeScript, and Storybook I built an AI résumé tool that refuses to lie about your experience The hat Azure Entra ID User & Role Management — Step-by-Step Practical Guide With A Simple Excercise The AI-Native Company: How a Single Founder Can Build Global Organizations Powered by AWS and an Ecosystem of Artificial Intelligences Building a Lightweight Remote MCP Knowledge Base on Cloudflare Workers Why I built Trinavo for the MENA merchants Western platforms ignore The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Training Data Provenance: The Manifest Diff That Explains the Hash Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler Tesla P40 in a Homelab: 24GB of Inference on a Budget Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users
How to Detect GPU Waste in a Kubernetes Cluster
Sam Hosseini · 2026-05-26 · via DEV Community

GPU waste in Kubernetes does not announce itself. Your cluster shows healthy utilization. Your dashboards are green. But 20–40% of your GPU capacity is doing nothing useful — burning money quietly in the background.

This post covers what GPU waste actually looks like in Kubernetes, which signals surface it, and how to go from suspicion to a concrete dollar figure.


Why Standard Kubernetes Monitoring Misses GPU Waste

Kubernetes was designed for CPU and memory workloads. Its built-in metrics — kubectl top, kube-state-metrics, node allocations — see resources at the pod level. They tell you a GPU is allocated. They do not tell you whether anything useful is running on it.

The most common forms of GPU waste in Kubernetes are invisible to standard tooling:

  • Idle allocation — a pod holds a GPU resource but runs no active inference or training. The GPU reports non-zero utilization from background processes, masking the waste.
  • Tier misplacement — a model that fits comfortably on an A10G is deployed on an H100, consuming 3–4x the memory bandwidth it needs. The GPU looks busy. The spend is unjustified.
  • CPU-bound stall — the GPU is waiting on CPU preprocessing, tokenization, or data loading. GPU utilization shows 70%. Actual compute throughput is a fraction of that.
  • KV cache pressure — context window growth causes KV cache evictions, degrading throughput without reducing the utilization number.
  • Orphaned workloads — experiments, notebooks, and test deployments left running. They hold GPU allocations indefinitely with no traffic.

Each of these looks fine from the Kubernetes scheduler's perspective. All of them cost real money.


The Metrics That Actually Surface Waste

Standard nvidia-smi and Kubernetes node metrics are not enough. You need GPU-level telemetry from NVIDIA DCGM.

Deploy dcgm-exporter as a DaemonSet on your GPU nodes:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

Enter fullscreen mode Exit fullscreen mode

This exposes per-GPU metrics into Prometheus at 1-second resolution. The ones that matter for waste detection:

Metric What it tells you
DCGM_FI_DEV_GPU_UTIL SM utilization — is the GPU doing compute work?
DCGM_FI_DEV_MEM_COPY_UTIL Memory bandwidth utilization — is data moving efficiently?
DCGM_FI_DEV_FB_USED Framebuffer memory in use — how much VRAM is occupied?
DCGM_FI_DEV_POWER_USAGE Power draw — a GPU drawing full power at low SM util is a clear waste signal

Waste thresholds to alert on for inference workloads:

Metric Waste signal
SM Utilization (10-min avg) < 20%
Memory bandwidth < 30%
Power draw > 80% of TDP with SM util < 20%
Allocated GPU with zero requests Any duration > 15 minutes

A GPU sitting at 5% SM utilization while drawing 400W on an H100 is a $4–8/hour waste signal. Multiply across a fleet and it becomes a budget problem.


Detecting Idle Allocation

The clearest waste signal is a pod holding a GPU resource with no active compute. You can surface this with a simple Prometheus query:

(
  kube_pod_container_resource_requests{resource="nvidia.com/gpu"} > 0
) unless on(pod, namespace) (
  DCGM_FI_DEV_GPU_UTIL > 5
)

Enter fullscreen mode Exit fullscreen mode

This returns every pod that has requested a GPU but whose GPU is below 5% utilization. These are your idle allocations. In most clusters this query returns more pods than expected.

For a quick scan without Prometheus, piqc — the open-source GPU waste scanner — runs this kind of detection against your live cluster in under a minute:

curl -sSL https://get.piqc.dev | bash
piqc scan

Enter fullscreen mode Exit fullscreen mode

It identifies idle GPUs, misplaced workloads, and dark capacity across namespaces and surfaces a waste estimate in dollars per day.


Detecting Tier Misplacement

Tier misplacement is harder to catch because the GPU looks busy. The signal is not utilization — it is the relationship between what the workload needs and what it has.

A 7B parameter model at FP16 requires roughly 14GB of VRAM. An A10G provides 24GB at ~250W TDP and costs roughly $1.10/hr on most clouds. An H100 provides 80GB at 700W TDP and costs roughly $3.50–$4.50/hr. Deploying the 7B model on an H100 wastes $2–3/hr per GPU with no throughput benefit.

To detect this you need to know what is running on each GPU — not just which pod holds the allocation, but which model, what its memory footprint is, and which tier it belongs on. Standard Kubernetes monitoring cannot answer this. It does not know what a model is.

This is where model-aware tooling matters. Paralleliq's Introspect maps each workload to its model, calculates the correct tier, and surfaces misplacement as a cost delta — not as an abstract utilization number.


Detecting CPU-Bound Stall

If your GPU utilization is moderate (40–70%) but throughput is lower than expected, the GPU is probably waiting on something upstream. Add CPU metrics to the same dashboard:

rate(container_cpu_usage_seconds_total{namespace="inference"}[5m])
  / on(pod) kube_pod_container_resource_requests{resource="cpu"}

Enter fullscreen mode Exit fullscreen mode

A CPU request saturation above 90% in the same pods where GPU SM utilization is below 60% is a CPU bottleneck. The GPU is idle because it has nothing to process.

Common causes: tokenization happening on CPU, single-threaded data loading, synchronous preprocessing before batching. Fix: move tokenization to GPU, increase CPU allocation, or add async preprocessing.


Putting a Dollar Figure on It

Waste without a dollar figure stays invisible in engineering conversations. With one, it becomes a budget line item.

Basic formula:

waste_cost_per_day = idle_gpus × gpu_cost_per_hour × 24
                   + misplaced_gpus × cost_delta_per_hour × 24

Enter fullscreen mode Exit fullscreen mode

For a cluster with:

  • 20 idle GPUs on A10G at $1.10/hr: $528/day
  • 10 H100s running models that belong on A10G (delta $2.50/hr): $600/day

Total: $1,128/day — $411k/year

Most teams running 100+ GPUs find this number on their first scan.


The Limit of Metric-by-Metric Detection

The approach above works. It surfaces waste. But it has a ceiling: you are looking at infrastructure signals without knowing what the infrastructure is running.

A GPU at 25% SM utilization might be:

  • An idle development deployment (waste)
  • A low-traffic production endpoint that is correctly sized (not waste)
  • A model waiting on a healthy request queue (not waste)

Distinguishing these requires workload context — which model is running, what traffic pattern it serves, what its expected utilization range is. Infrastructure metrics alone cannot answer this.

That is the difference between GPU monitoring and GPU fleet optimization. Monitoring tells you something is wrong. Optimization tells you what, why, and what to do about it — at the model level, not just the resource level.


Quick Start

To scan your cluster for GPU waste right now:

curl -sSL https://get.piqc.dev | bash
piqc scan

Enter fullscreen mode Exit fullscreen mode

For a fleet-level view with model-aware waste detection, tier misplacement analysis, and human-in-the-loop remediation, explore Paralleliq Introspect or book a free scan.