惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

DEV Community

Session Management, Rate Limiting & Caching using Redis Why Accuracy Is Not Enough: Evaluation Metrics Every AI Engineer Should Understand How I Built One Building Instagram Data Workflows with HikerAPI (Without Maintaining Scrapers) Claude Code can't open my browser. Cowork can't run my tests. So I wired them together. AGTP: A Transport Protocol Built for Agents I built Snipworth a Chrome extension to turn code into shareable images — and keep them for later My Friend's Two Android Apps, Three Months Lost, and Why We Built onTest Blue-Green Deployments Are Invisible. I Made Mine Visible. Here Is How. Need your attention on my current project Deterministic Telemetry Ingestion Pipeline for GridLoqer Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why How I Built a 7-Layer NL2SQL Guardrail Stack for a Fortune 500 Enterprise Identity in Web3 The Trap of "Perfect" Architecture: What Building a Shopping Cart Taught Me The Browser Boundary Model: APIs, CORS, Cookies, JSON, Files, and SEO ModelChain: Measurable LLM Router with Adaptive Model Selection, Real-Time Scoring, Budget Guards and Failover for Node.js, Edge and Browser I Built a 25-Agent Polish Parliament That Drafts Bills With Real Legal Citations KeyMesh: Zero-Runtime-Dependency API Key Rotation, Circuit Breaker and Failover for Production LLM Applications in Node.js Claude Code's workflow docs are a menu. Building a home server with a mini PC Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM I built an open source SDK to catch AI agent regressions before they ship. Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down" The Bug That Passes Every Toolchain Check: Circular Dependencies in JavaScript Great Stack to Doesn't Work Bonus: SQL vs NoSQL: Which One in 2026? Great Stack to Doesn't Work #2 — Kafka: "Where Did My Messages Go?" I built a detention-pay calculator for truckers in a day — unglamourous niches beat another AI wrapper The Same AI Model Can Perform 6x Better: Here's Why SQL-like Queries in FSRS Plugin for Obsidian [Imposter syndrome] Back to the beginning (DevSecOps path) How to Build a Kundali App with Free Vedic Astrology API — Step by Step Ideias Valem Muito Menos do Que Você Imagina [PT-BR] cgroups and Namespaces — The Linux Kernel's Building Blocks Behind Containers Hermes Blueprint: A Multi-Agent Hedge Fund Morning Briefing System Why We Abandoned Java for Our Treasure Hunt Engine and Embraced the Complexity of Rust Building a RAG System in Rust with Qdrant, Rig, and gRPC 🦀 Ecommerce Search API: Add Visual and Semantic Search Bots read fast pages too: what we reprioritised after an AI-crawler audit Tu navegador te conoce mejor de lo que crees: privacidad en 2026 From Zero to DevOps in Pakistan: My Real Journey With No CS Degree Astro 6.4 + Cosmic: The Fastest Content Stack in 2026 Inferred context is not a dependency graph A Simpler ButtonComponent: Just Render a Div Small Go Detail That Changes How Your Project Looks I Built a SaaS. Nobody Came. Here's What I Learned the Hard Way. From Vitals to Variables: How AutoAI Automates the Heavy Lifting of Machine Learning Home-Bottom Row Modifier Clusters We Trusted Auto-Ack. The Queue Agreed. Our Costs Didn't. DevOps for Developers: Reducing Cognitive Load and Boosting Transparency Python pytest: Write Tests That Actually Help You How I bypassed Vercel Serverless timeouts to build a decoupled document ingestion pipeline The Case for a Dedicated Reliability Engineer Next.js SaaS Boilerplate with BetterAuth, RBAC, i18n & Production-Ready Setup Reverse Engineer Any Database into dbdiagram.io, PlantUML, Mermaid, or QuickDBD - Then Keep Designing Your AI coding agent doesn't need a smarter model. It needs your backlog. I built a free streaming site from scratch — no ads, no framework, no BS I Can't Believe This AI Agent Runs on a $5 VPS — And It Puts $99/Month Frameworks to Shame Beyond Static Prompts: How to Build Self-Improving AI Agents with Closed-Loop Skill Playbooks How I Taught My Incident Alerts to Say "This Broke 3 Minutes After Your Last Deploy" Why I Stopped Treating Job Applications as My Only Career Strategy Stop Watching Tutorials, Start Coding: How I Built CodeQuizz, an AI-Powered Active Learning Engine How We Generate 300+ AI Business Ideas a Month With GPT-5 (and Filter the Junk Out) The Intent Layer Your AI Coding Agent Does Not Need a Bigger Prompt How I solved a problem in my house using with an AI-powered application! Structure: A Local-First Interview IDE Powered by Gemma 4 Build in public, month 2: 615 of 616 visitors never clicked anything Someone wrote a fake EULA into Bitcoin. Two hours later they revoked it. Insights of Git ( part : 1 ) Someone wrote a fake EULA into Bitcoin. Two hours later they revoked it. Payload CMS Has 508 Circular Dependencies. Next.js Has 17. Here's Why They Form in Every Large JS Codebase. Prompt Packs Are Dead. Long Live Skills Why I Started Building a Portfolio Tracker Senior developer" after 3 years is title laundering Stripe Webhook Idempotency in FastAPI: Handling Duplicate Events Without Double-Charging SaaS Customers What Happens Before Your C Program Reaches the CPU? FinOps for Startups: How to Keep Your AWS Bill Under $100/Month Configuring CORS in Azure API Management How RBI Quietly Created a New Billion Dollar Industry in International Payments Time Need To Rearrange Binary String I Updated My GitHub Auto-Commit Desktop App I Have Reviewed Over 400 Resumes for Tech roles. Here Is What Actually Gets You the Phone Screen [Boost] Awesomeness! We built a lightweight, 100% local File Integrity Monitor (FIM) with zero telemetry Building chart() for Tala: From Raw Indicator Data to Something You Can Actually Inspect A client-side secret scanner that physically can't exfiltrate your code (and why you shouldn't trust mine either) Your AI Agent Should Text You First Built free app for game design and worldbuilding You Have a Free AI Model Sitting in Chrome Right Now I created a fork of GunDB and rewrote it in TypeScript using Vibe Code 6 Advanced JavaScript Questions That Separate Seniors from Mid-Levels Claude Does Not Need More Prompts. It Needs Reasoning Discipline. An Introduction to AI Hub, Part 2: Custom MCP Servers I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS How I built a dependency risk scanner with Coral in 7 days Local-first: a Model on Your Own Machine, Zero Cloud 2487. Remove Nodes From Linked List C_STD : A Leak-Free, Cross-Platform Standard Library for Modern C
Why a deleted backup Lambda kept billing 9,400 EBS snapshots
Muhammad Hassaan Javed · 2026-05-31 · via DEV Community

The EBS Snapshot line on the monthly bill was $1,830. There was no active EBS snapshot policy on the account. The backup Lambda that had produced these snapshots had been deleted thirteen months earlier, replaced by AWS Backup, and forgotten. Nobody had deleted what it created. Two volumes' worth of daily snapshots times 400 days came to 9,408 orphans sitting on 14 TB of storage, billed at the EBS Snapshot rate every month since.

Problem signals:

  • EBS Snapshot line is several hundred dollars a month and no active EBS snapshot pipeline is running on the account
  • describe-snapshots --owner-ids self returns thousands of entries when you expect dozens
  • Sampling a few snapshot IDs shows SourceVolumeId values that no longer resolve in describe-volumes
  • A backup Lambda or custom snapshot script was deprecated in the last 12 to 24 months
  • AWS Backup is the active tool and its dashboard shows normal counts, but the cost line tells a different story

$1,830 a month on a backup product the account no longer used

The line item that should have been zero

The EBS Snapshot line had been climbing slowly for thirteen months. Nobody had flagged it. The quarterly cost review surfaced it because the line item ranked sixth on the account, and the team's mental model said it should have ranked nowhere. There was no EBS snapshot policy running. AWS Backup had taken over RDS and EBS backups a year earlier, with the old Lambda plus EventBridge pipeline retired the same week.

The first instinct in the room was to pull AWS Backup's plan and see if a retention window had widened. The plan was clean. Snapshot counts there were in the low dozens, exactly what the new policy specified. So the snapshots driving the bill were coming from somewhere else.

$ aws ec2 describe-snapshots --owner-ids self \
    --query 'length(Snapshots)' --output text
9408

Enter fullscreen mode Exit fullscreen mode

The number that turned a routine cost review into an incident.

That number was the moment the room got quiet. AWS Backup writes maybe forty snapshots a month on this account. Nine thousand was a different category of problem.

AWS Backup was clean, so who made these 9,408 snapshots

Ruling out the obvious suspect

With AWS Backup ruled out and no other named pipeline running, the question became: who created these 9,408 snapshots, and is anything still creating more. We pulled the StartTime field on the most recent hundred. The newest one was thirteen months old. Whatever pipeline made them had stopped, which meant we were looking at a stable population, not a leak that was still growing. That mattered because it meant the cleanup had a known size.

The next question was whether the source volumes were still around. We sampled twenty random snapshots and ran describe-volumes against their SourceVolumeId. All twenty came back InvalidVolume.NotFound. The pattern was clear: the snapshots were referencing two specific volume IDs (the daily Lambda backed up two production EBS volumes), both of which had been deleted along with the EC2 instances they served when the application moved to a managed service.

aws ec2 describe-snapshots --owner-ids self \
    --query 'Snapshots[*].[SnapshotId,VolumeId,StartTime]' \
    --output text > all-snapshots.tsv

awk -F'\t' '{print $2}' all-snapshots.tsv | sort -u \
  | while read vid; do
      if ! aws ec2 describe-volumes --volume-ids "$vid" \
          >/dev/null 2>&1; then
        echo "$vid orphan"
      fi
    done > orphan-source-volumes.txt

Enter fullscreen mode Exit fullscreen mode

Group snapshots by their source volume, then check which source volumes still exist.

Only two volume IDs appeared in the orphan list. Two volumes, 400 days of daily snapshots each, give or take retries, gave 9,408. The arithmetic lined up. The Lambda that snapshot them was gone, but AWS does not garbage-collect snapshots when their creator disappears. Snapshots are first-class objects with their own lifecycle, and that lifecycle is whatever you set when you create them. The Lambda set nothing.

Why we sampled twenty before touching the other 9,388

What we did before running delete-snapshot in a loop

The temptation at this point is to write a one-line loop and delete everything. delete-snapshot is irreversible. The cost was real, $1,830 a month for storage of data that referenced infrastructure that no longer existed. Two reasons we did not run the loop immediately.

First, orphan is sometimes a transient state. A volume gets deleted on Tuesday during a planned migration. On Wednesday the orphan-finder runs. A snapshot taken two hours before the volume's deletion looks orphaned but is actually the most recent backup of a service that was just migrated. Deleting it would destroy the only remaining copy of that data. We checked the StartTime on every snapshot in our sample against the deletion date of its source volume. Every one was older than the deletion by at least nine months. The cohort was uniformly historical. No active workflow could be depending on any of them.

Second, we needed to be sure these snapshots were not being referenced as the base for any AMI or any live AWS Backup recovery point. We ran describe-images with a block-device-mapping.snapshot-id filter on the sample, expecting nothing, and got nothing. We checked the AWS Backup recovery point inventory. None of the orphan snapshot IDs appeared there. The deletion was safe.

The actual delete loop took three calendar days. delete-snapshot is rate-limited at roughly 5 requests per second per account with bursts. At 9,400 deletes with retries on the occasional 503, the math runs to about 30 wall-clock minutes of perfect throughput. We never get perfect throughput. We wrote the loop with a 250ms sleep, a checkpoint file, and an append-only deleted.log so we could resume after any interruption without re-trying ones that already succeeded.

while read sid; do
  if grep -qx "$sid" deleted.log; then continue; fi
  aws ec2 delete-snapshot --snapshot-id "$sid" \
    && echo "$sid" >> deleted.log \
    || echo "$sid" >> failed.log
  sleep 0.25
done < orphan-snapshot-ids.txt

Enter fullscreen mode Exit fullscreen mode

Resumable, rate-limited delete loop. The checkpoint file is the load-bearing part.

After three days the EBS Snapshot line on the next monthly forecast dropped to under $20. The fourteen terabytes of orphan storage was gone.

Tag at creation, schedule the cleanup, watch the lines that should be zero

The rule that meant the next deprecated pipeline could not do this

The deletion fixed the symptom. The interesting part of this engagement was the cause. AWS does not couple a snapshot's lifecycle to the lifecycle of whatever process created it. A Lambda gets deleted, an EventBridge rule gets removed, the IAM role goes with them, and the snapshots they made keep existing and keep being billed, forever, until something explicitly deletes them. There is no warning email. There is no dashboard widget. The only signal is the monthly bill, and the bill takes a year to be loud enough to investigate.

Two changes went in after the cleanup. The first was a tag-at-creation rule. Every snapshot the account creates now carries three tags applied at creation time: Owner (a team or service name), Retention (an ISO date past which the snapshot is safe to delete), and CreatedBy (the pipeline that made it). AWS Backup applies these automatically through its backup plan. The handful of custom Lambdas that survived the migration were rewritten to apply them. A weekly cleanup Lambda walks the account, deletes anything past its Retention date, and flags anything older than 90 days with no Retention tag. For the first 60 days the Lambda posted a Slack message and waited for a thumbs-up before deleting. After that it ran automatic.

The second change was to the quarterly cost review process. It now starts with the line items that should be zero or near zero, not the ones that are already big. The big lines get watched constantly by capacity planners. The lines that should be zero are where deleted infrastructure leaves footprints, and they are the ones least likely to be on anybody's dashboard. EBS Snapshot on a no-EBS-snapshot account. Lambda invocations on a service that was migrated to ECS six months ago. NAT Gateway hours on a workload that should not need cross-AZ egress. These are the lines where deprecated pipelines keep paying rent.

The lifecycle every snapshot now goes through. Untagged snapshots cannot live past 90 days without an explicit decision.

The lifecycle every snapshot now goes through. Untagged snapshots cannot live past 90 days without an explicit decision.

Cost archaeology on accounts where a deprecated pipeline is still paying rent

When the bill is the only thing telling you what you forgot

The shape of this incident is common. A pipeline gets shipped, the engineer who wrote it leaves, the policy gets replaced but the outputs survive, and the bill slowly bends upward. EBS snapshots are the most common shape we see. Detached EIPs are close behind. Idle NAT gateways and orphaned ElastiCache clusters round out the top four. None of these line items alarm on a CloudWatch dashboard because nothing is actively misbehaving. The deprecated pipeline is the misbehavior, and the pipeline no longer exists.

We run these cost-archaeology engagements regularly. In the last quarter we walked through three accounts where a single deprecated backup pipeline accounted for more than half of the account's EBS Snapshot line. We have an inventory script that finds orphan snapshots, detached volumes, unused EIPs, and idle NAT gateways across an account in about 20 minutes, plus a sample-then-delete workflow we walk the team through live so nothing irreversible happens on autopilot. The deletion is always the easy half. The work is figuring out which orphans are safe and writing the tag-at-creation policy that stops the next one.

If your bill has a line that does not match anything that should be running, the orphan audit is usually the fastest way to find out where it is going. Request an infrastructure review and we will run the audit with your team on a 30-minute diagnostic call this week. You can also see the broader pattern in our services overview for cloud cost spike work.


Originally published at https://infraforge.agency/insights/orphan-ebs-snapshots-deleted-backup-pipeline-cost-spike/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.