惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fortinet All Blogs
腾讯CDC
B
Blog
Recorded Future
Recorded Future
V
Visual Studio Blog
WordPress大学
WordPress大学
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
PCI Perspectives
PCI Perspectives
I
InfoQ
博客园 - 聂微东
博客园 - 【当耐特】
宝玉的分享
宝玉的分享
T
Tailwind CSS Blog
T
The Blog of Author Tim Ferriss
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
小众软件
小众软件
Blog — PlanetScale
Blog — PlanetScale
Microsoft Security Blog
Microsoft Security Blog
雷峰网
雷峰网
aimingoo的专栏
aimingoo的专栏
Hugging Face - Blog
Hugging Face - Blog
人人都是产品经理
人人都是产品经理
云风的 BLOG
云风的 BLOG
P
Proofpoint News Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
DataBreaches.Net
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
有赞技术团队
有赞技术团队
C
Check Point Blog
Stack Overflow Blog
Stack Overflow Blog
MyScale Blog
MyScale Blog
Google DeepMind News
Google DeepMind News
量子位
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - Franky
Spread Privacy
Spread Privacy
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LangChain Blog
G
Google Developers Blog
U
Unit 42
Recent Announcements
Recent Announcements
L
Lohrmann on Cybersecurity
P
Palo Alto Networks Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
MongoDB | Blog
MongoDB | Blog
K
Kaspersky official blog
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
Cyberwarzone
Cyberwarzone

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Learn Command Line Interface (CLI) Development with Dart: From Zero to a Fully Published Developer Tool How to Build a Live Options Database in Python – A Complete Guide How to Migrate to S3 Native State Locking in Terraform How to Use SCons to Build Software Projects [Full Handbook] How to Run Open Source LLMs Locally and in the Cloud QuRT: The Real-Time OS Inside Your Phone's Processor [Full Handbook] The Real Infrastructure Behind Remote Work (It’s Not Just Wi-Fi) The Lithography Handbook: Machines, Markets, and the Next Wave of Semiconductor Startups ITCM vs DTCM vs DDR: Embedded Memory Types Explained [Full Handbook] AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1) How to Build a Market Research Copilot with MCP and Python [Full Handbook] How to Build a Scoped Note-Taking API with Django Rest Framework and SimpleJWT The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands Mastering the JavaScript Event Loop Data Science Insights: Why the Mean Lies When Handling Messy Retail Data How to Build High-Ranking SEO Landing Page How to Query Data in DynamoDB Using .Net How to Unblock Your AI PR Review Bottleneck: A Tech Lead’s Guide to Building a Codebase-Aware Reviewer How to Navigate Microservices as a Frontend Engineer How to Compress PDF Files in the Browser Using JavaScript (Step-by-Step) Stanford's youngest instructor talks InfoSec, AI, and catching cheaters - Rachel Fernandez interview [Podcast #217] Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python How to Build a Multi-Agent AI System with LangGraph, MCP, and A2A [Full Book] How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway How to Dockerize a Go Application – Full Step-by-Step Walkthrough Learn Hardware, Cloud, DevOps, Networking, Security, Databases, DNS, Git, and Linux Inside TreeHacks 2026, Stanford’s Elite Student Hakc Inside Stanford’s Elite Student Hackathon [Full Documentary] How to Measure Your AI Citation Rate Across ChatGPT, Perplexity, and Claude How to Deploy a Full-Stack Next.js App on Cloudflare Workers with GitHub Actions CI/CD How to Build a Multi-Tenant SaaS Platform with Next.js, Express, and Prisma How I Completed 15 freeCodeCamp Certifications in 4 Months: A Structured Learning Journey How to Build an Agentic Terminal Workflow with GitHub Copilot CLI and MCP Servers How AI Changed the Economics of Writing Clean Code How to Apply STRIDE Threat Modeling and SonarQube Analysis for Secure Software Development How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS How to Split PDF Files in the Browser Using JavaScript (Step-by-Step) How to Build Your Own Language-Specific LLM [Full Handbook] How to Build a Self-Learning RAG System with Knowledge Reflection How to Trace Multi-Agent AI Swarms with Jaeger v2 How I Tested Malaysia's Open Data Portals with Plain English How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik The Hidden Tax of Infrastructure: Why Your Team Shouldn’t Be Running It Anymore From Metrics to Meaning: How PaaS Helps Developers Understand Production From Symptoms to Root Cause: How to Use the 5 Whys Technique Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP 3D Web Development with Blender and Three.js How to Fix a Failing GitHub PR: Debugging CI, Lint Errors, and Build Errors Step by Step How to Merge PDF Files in the Browser Using JavaScript (Step-by-Step) How to Handle Stripe Webhooks Reliably with Background Jobs How to Build an Automatic Knowledge Graph for Your Blog with PHP and JSON-LD Understanding Proxies and Reverse Proxies: Your Gateway to Secure Networking The Evolution of Nvidia Blackwell GPU Memory Architecture How to Use PostgreSQL as a Cache, Queue, and Search Engine The New Definition of Software Engineering in the Age of AI Reclaim Your Time – Master Automation with Zapier How to Create Dynamic Emails in Go with React Email Why Many Beginner Self-Taught Developers Struggle (And What to Do About It) How to Build a Headless WordPress Frontend with Astro SSR on Cloudflare Pages How to Make Your GitHub Profile Stand Out How to Use Context Hub (chub) to Build a Companion Relevance Engine Why Chrome OS Is the Operating System the AI Era Was Built For How to Build Microservices-Based REST APIs for Healthcare Portals How to friction-max your learning with software engineer Jessica Rose [Podcast #216] Shadow AI Explained: Why Employees Are Using AI Behind Your Back Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams How Database Indexes Work – A Practical Guide with PostgreSQL Examples How to Streamline Search in Web Applications with Elasticsearch How to Build an Open Source Data Lake for Batch Ingestion OpenAI Codex Essentials – AI Assisted Agentic Development Course Learn Software System Design How to Generate PDF Files in the Browser Using JavaScript (With a Real Invoice Example) How to Get Started with Terraform Service-to-Service Communication: When to Use REST, gRPC, and Event-Driven Messaging A Developer’s Guide to Lazy Loading in React and Next.js The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained. United States Residential Proxy: Why Local IP Accuracy Matters for SERP, Ads, and Pricing How to Build a Fashion App That Helps You Organize Your Wardrobe How to Build an Admin Dashboard Sidebar with shadcn/ui and Base UI The AI Governance Handbook: How to Build Responsible AI Systems That Actually Ship How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible How to Use Mixins in Flutter [Full Handbook] How to Prep for Technical Interviews – A Guide for Web Developers GPT-5.4 vs GLM-5: Is Open Source Finally Matching Proprietary AI? Data Visualization Tools for Svelte Developers How to Keep Human Experts Visible in Your AI-Assisted Codebase Efficient Data Processing in Python: Batch vs Streaming Pipelines Explained How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU) How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript How to Build a Positioning-Based Crude Oil Strategy in Python [Full Handbook] How to learn programming and CS in the AI hype era – interview with dev and prof Mark Mahoney [Podcast #215] CUDA Programming for NVIDIA H100s How to Build Reliable AI Systems. How to Build an Online Marketplace with Next.js, Express, and Stripe Connect How to Build a Cost-Efficient AI Agent with Tiered Model Routing The WebCodecs Handbook: Native Video Processing in the Browser The Bluetooth LE Audio Handbook: From "Why Does My Call Sound Like a Tin Can?" to AOSP Implementation How to Set Up OpenClaw and Design an A2A Plugin Bridge
The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager
Ayobami Adejumo · 2026-06-16 · via freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager

My first AWS bill was $23,000. I had been working at the company for three weeks.

Nobody told me. The bill just grew quietly in the background while I was proud of the feature I shipped. A Lambda function that called an external enrichment API on every user event. Clean code. Solid tests. Thirty-two million events that month. At $0.0007 per API call.

My engineering manager forwarded the invoice with two words: "Please explain."

That was the moment I discovered FinOps — not from a conference talk or a certification course, but from the specific shame of having written expensive code and not knowing it until the damage was done.

This roadmap is what I needed that day. A complete, honest guide to transforming from an engineer who builds things that work into an engineer who builds things that work and cost what they should. By the end of this guide, you'll have the skills, the scripts, and the vocabulary to talk about cloud spend the way a CFO and a CTO both want to hear.

Table of Contents

  • What You'll Learn

  • Prerequisites

  • The Four Stages Overview

  • Stage 1: The Cost-Aware Engineer — Months 1 to 3

  • Stage 2: The Optimisation Specialist — Months 4 to 8

  • Stage 3: The Automation Architect — Months 9 to 15

  • Stage 4: The Cloud Financial Manager — Months 16 to 24

  • Essential Tools and Certifications

  • Your 90-Day Action Plan

  • Best Practices Summary

  • Resources

What You'll Learn

  • How to read your AWS bill as an engineer, not as a passive observer

  • The exact tagging strategy that makes cost attribution possible

  • How to right-size EC2 and RDS instances using CloudWatch data you already have

  • The correct sequence for purchasing Savings Plans — and why sequence matters more than the discount percentage

  • How to build automated cleanup systems for orphaned resources

  • How to present cloud cost findings to engineering leadership with data that drives decisions

  • The chargeback and showback models that make cost accountability stick

Let's begin.

Prerequisites

Before following this roadmap, you should have some skills and tools ready to go.

Knowledge:

  • You can deploy an application to AWS (EC2, Lambda, or containers)

  • You understand basic AWS services: S3, RDS, EC2, VPC, IAM

  • You're comfortable reading Python and writing simple bash scripts

  • You know what a pull request is and have gone through at least one code review

Access:

  • Read-only access to your AWS billing console and Cost Explorer

  • AWS CLI v2 configured with at least ReadOnlyAccess policy attached

  • Python 3.9 or later for running the audit scripts in this guide

Mindset: You don't need to be a finance expert. But you do need to be willing to look at numbers that might be uncomfortable. Every engineer I've worked with who became excellent at FinOps had one thing in common: they were willing to be the person who asked "but what does this cost?" in a room where nobody else wanted to.

Estimated time: This roadmap covers 24 months of deliberate skill-building. You can absorb the reading in a few evenings. The practice is the 24 months.

The Four Stages Overview

Before going deep, here's the complete picture of where you're going:

Stage 1 — Cost-Aware Engineer (Months 1–3)
├── Read your cloud bill and understand it
├── Tag every resource with meaningful metadata
├── Identify your top 5 cost drivers
└── Block your first expensive PR with cost justification

Stage 2 — Optimisation Specialist (Months 4–8)
├── Right-size every over-provisioned resource
├── Implement storage lifecycle policies
├── Move non-production to Spot instances
└── Purchase your first Savings Plan in the right order

Stage 3 — Automation Architect (Months 9–15)
├── Build automated cleanup for orphaned resources
├── Add cost estimation to your CI/CD pipeline
├── Create cost-aware auto-scaling triggers
└── Deploy a self-service FinOps dashboard

Stage 4 — Cloud Financial Manager (Months 16–24)
├── Lead monthly FinOps reviews with engineering leadership
├── Build chargeback models for departments
├── Negotiate enterprise agreements with AWS
└── Forecast cloud spend within 5% variance

The reason this is a 24-month journey and not a weekend project: each stage builds on the previous one. Engineers who jump straight to Savings Plans without rightsizing first end up paying discounted prices for waste. Engineers who build dashboards before tagging get beautiful charts with no actionable data. The sequence isn't arbitrary.

Stage 1: The Cost-Aware Engineer — Months 1 to 3

1.1 Reading the Bill Like an Engineer, Not an Accountant

The default AWS Cost Explorer view shows you service-level totals. That's accounting. What you need is engineering-level decomposition: which specific resources cost money, what business function they serve, and whether each dollar is justified.

Start by pulling a proper breakdown:

# Pull last month's cost breakdown grouped by service
# Run this before touching any optimisation — this is your baseline
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Save the output. Name the file aws-baseline-YYYY-MM.txt. You'll compare every future month against this number. Without a baseline, you can't measure progress — and without measurable progress, you can't make the case to leadership that the work is worth engineering time.

Three questions for every service in your top 5:

Most engineers stop at "what is this service?" and never reach the useful question. Here's the framework I use when I first audit an account:

The first question is whether you know what specific business function this service is performing. Not the product name, the function. "S3" isn't an answer. "Storing unprocessed video uploads that sit for 90 days before anyone watches them" is an answer.

The second question is whether the cost is growing, stable, or shrinking when you look at the past three months. A stable \(12,000/month is a different problem from a \)12,000/month line that was $4,000 six months ago.

The third question is what percentage of your total bill this service represents. Optimising a 1% line item while a 40% line item runs unchecked is a common time-wasting trap.

1.2 The Tagging Strategy That Actually Survives

Here's the honest truth about tagging: most tagging strategies die within six months because they're designed for reporting rather than for engineers. Engineers don't tag things well when they're moving fast. The solution isn't to demand more discipline. Instead, it's to make tagging enforced at the infrastructure layer.

Here's the minimal viable tag set (the six tags that cover 90% of attribution needs):

# These six tags enable cost attribution, accountability, and automated remediation
# Add these to every resource in your AWS account — EC2, RDS, S3, Lambda, everything

Environment: "production" | "staging" | "dev"
Team: "platform" | "backend" | "data" | "ml"
Service: "payment-api" | "fraud-detection" | "user-service"
Owner: "ayo@cloudfrugal.com"     # Person responsible for this resource
CostCenter: "engineering"         # For chargeback reporting
AutoShutdown: "true" | "false"    # Enables automated remediation

Enforce tags at the Terraform level so they can't be skipped:

# variables.tf
# Add this to your Terraform root module
# Any plan that creates a resource without these tags will fail validation

variable "required_tags" {
  description = "Tags required on every resource in this account"
  type = map(string)
  
  validation {
    condition = contains(keys(var.required_tags), "Environment") &&
                contains(keys(var.required_tags), "Team") &&
                contains(keys(var.required_tags), "Owner")
    error_message = "required_tags must include Environment, Team, and Owner."
  }
}

# Apply in every resource
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  tags = merge(var.required_tags, {
    Name    = "app-server-${var.environment}"
    Service = "payment-api"
  })
}

Find everything that's currently untagged:

# List EC2 instances missing the Team tag
# Run this weekly until you hit zero results
aws ec2 describe-instances \
  --query "Reservations[].Instances[?!not_null(Tags[?Key=='Team'].Value | [0])].[InstanceId, InstanceType, State.Name]" \
  --output table

Once you start finding untagged resources, you'll discover a pattern: the oldest resources in the account are the least tagged, and they're often the most expensive. An EC2 instance from 2021 that predates your tagging policy is exactly the kind of thing that generates a $3,000/month line item nobody can explain.

1.3 The Cost-Aware Code Review

The most underused FinOps practice in engineering teams is reviewing code changes for cost implications before they merge. It takes thirty seconds per PR once you build the habit, and it prevents the kind of problem that opened this guide: the expensive feature that nobody priced before shipping.

Add this section to your PR template:

## Cost Impact (required for infrastructure and data changes)

- [ ] This change does not affect cloud resource usage
- [ ] New API calls introduced: estimated cost per call $______, calls/month ______
- [ ] New data storage: estimated monthly delta $______
- [ ] Cross-region data transfer introduced: yes / no
- [ ] New external service dependency with per-call pricing: yes / no

If any box other than the first is checked, add a cost estimate before requesting review.

The discipline is in making cost estimation a first-class review concern, not an afterthought that gets caught by the finance team on the 15th of the month.

Stage 1 Outcomes

By the end of month 3, you should have a baseline cost breakdown on file, 100% tag coverage on active resources, identified your top 5 cost drivers with specific reduction targets, and blocked at least one expensive PR with a cost justification that held up in review.

Stage 2: The Optimisation Specialist — Months 4 to 8

2.1 Right-Sizing: The 80/20 of Cloud Savings

The single most reliable source of cloud waste I find in every account I audit is over-provisioned compute.

The pattern is consistent: an engineer provisions an instance at a size that handles their anticipated peak load, the peak never quite materialises at the expected scale, and nobody revisits the instance size because there's no automatic signal that says "this machine is 75% empty."

Make sure you verify actual utilisation before changing anything:

# rightsize_analyzer.py
# Finds EC2 instances running below 20% average CPU for 14 days
# These are right-sizing candidates — not automatic deletions

import boto3
from datetime import datetime, timedelta

def find_oversized_instances(region='us-east-1'):
    """
    Returns instances with average CPU below 20% for the last 14 days.
    Low CPU alone doesn't mean right-size — check memory too if CW agent installed.
    """
    ec2 = boto3.client('ec2', region_name=region)
    cw  = boto3.client('cloudwatch', region_name=region)

    reservations = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )['Reservations']

    candidates = []

    for r in reservations:
        for inst in r['Instances']:
            iid  = inst['InstanceId']
            itype = inst['InstanceType']
            tags = {t['Key']: t['Value'] for t in inst.get('Tags', [])}

            # Pull 14-day average CPU from CloudWatch
            stats = cw.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': iid}],
                StartTime=datetime.utcnow() - timedelta(days=14),
                EndTime=datetime.utcnow(),
                Period=1209600,   # One 14-day period
                Statistics=['Average']
            )['Datapoints']

            avg_cpu = stats[0]['Average'] if stats else 0.0

            if avg_cpu < 20.0:
                candidates.append({
                    'instance_id':  iid,
                    'instance_type': itype,
                    'avg_cpu_pct':  round(avg_cpu, 1),
                    'environment':  tags.get('Environment', 'unknown'),
                    'owner':        tags.get('Owner', 'unknown'),
                    'team':         tags.get('Team', 'unknown'),
                })

    return sorted(candidates, key=lambda x: x['avg_cpu_pct'])

if __name__ == '__main__':
    results = find_oversized_instances()
    print(f"\nFound {len(results)} right-sizing candidates:\n")
    for r in results:
        print(f"  {r['instance_id']} ({r['instance_type']}) — "
              f"{r['avg_cpu_pct']}% avg CPU — "
              f"owner: {r['owner']}")

A word of caution: CPU utilisation below 20% is a signal, not a verdict. Some workloads are memory-intensive or I/O-bound and will show low CPU while being correctly sized. Before acting on any right-sizing recommendation, check memory utilisation (requires the CloudWatch agent) and network I/O patterns alongside CPU.

2.2 Storage Tiering: Stop Paying Retail for Cold Data

S3 Standard costs \(0.023 per GB per month. S3 Glacier Deep Archive costs \)0.00099 per GB per month. The difference is a factor of 23. If you have data that you last accessed six months ago and you're keeping it in S3 Standard because nobody set up lifecycle policies, you're paying 23x more than necessary.

The complete S3 lifecycle policy for engineering teams:

{
  "Rules": [
    {
      "ID": "application-logs-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {"Days": 30,  "StorageClass": "STANDARD_IA"},
        {"Days": 90,  "StorageClass": "GLACIER_IR"},
        {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
      ],
      "Expiration": {"Days": 2555},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    },
    {
      "ID": "training-checkpoints-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "ml-checkpoints/"},
      "Transitions": [
        {"Days": 7,  "StorageClass": "STANDARD_IA"},
        {"Days": 30, "StorageClass": "GLACIER_IR"}
      ],
      "Expiration": {"Days": 90}
    }
  ]
}
# Apply the lifecycle policy to a bucket
aws s3api put-bucket-lifecycle-configuration \
  --bucket your-logs-bucket \
  --lifecycle-configuration file://lifecycle.json

# Verify it applied correctly
aws s3api get-bucket-lifecycle-configuration \
  --bucket your-logs-bucket

2.3 Savings Plans: The Sequence Is Everything

A Savings Plan is a commitment to spend a minimum dollar amount per hour on AWS compute for one or three years, in exchange for discounts of 30–70% off On-Demand rates. The discount is real. The trap is buying before optimising.

The wrong order: You have a \(50,000/month EC2 bill. You buy a Savings Plan covering \)35,000/hour. Then you implement right-sizing and Spot instances — and your actual spend drops to \(22,000/month. You've committed to paying \)35,000/month for 12 months against a need of \(22,000. You're paying \)13,000/month for compute you don't use, at a 30% discount. Congratulations on your discounted waste.

The right order:

Month 1-2: Right-size all instances using VPA and CloudWatch data
Month 3:   Move staging and development to Spot instances
Month 4:   Migrate compatible workloads to Graviton (20% cheaper)
Month 5:   Add VPC endpoints to eliminate NAT Gateway charges
Month 6:   THEN look at your steady-state On-Demand spend
Month 6+:  Purchase Savings Plans covering 70% of that optimised baseline

Calculate what to commit to:

# Get your On-Demand EC2 spend for the last 30 days
# This is your rightsized baseline — the number to commit against
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE",       "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own recommendation for what to commit
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

Stage 3: The Automation Architect — Months 9 to 15

3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself

Orphaned resources are the cloud equivalent of a gym membership you forgot to cancel. They exist, they charge you, but nobody notices until the annual audit.

The root cause isn't laziness. It's the absence of lifecycle management at the infrastructure layer. When an engineer spins up an EC2 instance for a one-week experiment and then leaves the company, there's no automatic signal that the instance is now orphaned. It sits there, billing $140/month, until someone hunts it down.

The fix is a weekly automated audit that surfaces candidates for deletion and notifies the registered owner, not a process change that depends on engineers remembering to clean up.

# orphan_reporter.py
# Runs every Sunday via EventBridge → Lambda
# Posts a Slack report of orphaned resources for human review
# DOES NOT auto-delete — deletion requires a human decision

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
UNATTACHED_VOLUME_AGE_DAYS = 14
SNAPSHOT_AGE_DAYS = 90


def find_orphaned_resources():
    ec2 = boto3.client('ec2')
    report = {'monthly_waste_usd': 0, 'items': []}

    # Unattached EBS volumes
    for vol in ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']:
        age = (datetime.now(timezone.utc) - vol['CreateTime']).days
        if age >= UNATTACHED_VOLUME_AGE_DAYS:
            cost = round(vol['Size'] * 0.08, 2)  # gp3 rate
            tags = {t['Key']: t['Value'] for t in vol.get('Tags', [])}
            report['items'].append({
                'type':  'Unattached EBS Volume',
                'id':    vol['VolumeId'],
                'detail': f"{vol['Size']}GB {vol['VolumeType']} — {age} days old",
                'owner': tags.get('Owner', 'unknown'),
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    # Unassociated Elastic IPs
    for addr in ec2.describe_addresses()['Addresses']:
        if 'AssociationId' not in addr:
            report['items'].append({
                'type':  'Unassociated Elastic IP',
                'id':    addr.get('AllocationId', addr['PublicIp']),
                'detail': addr['PublicIp'],
                'owner': 'unknown',
                'monthly_cost_usd': 3.60,
            })
            report['monthly_waste_usd'] += 3.60

    # Old snapshots
    cutoff = (datetime.now(timezone.utc) - timedelta(days=SNAPSHOT_AGE_DAYS)).isoformat()
    for snap in ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']:
        if snap['StartTime'].isoformat() < cutoff:
            cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
            report['items'].append({
                'type':  f'Snapshot ({SNAPSHOT_AGE_DAYS}+ days old)',
                'id':    snap['SnapshotId'],
                'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
                'owner': 'unknown',
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    return report


def post_to_slack(report):
    lines = [
        f":money_with_wings: *Weekly Orphaned Resource Report*",
        f"Found *{len(report['items'])} orphaned resources* "
        f"costing *${report['monthly_waste_usd']:.2f}/month*\n",
    ]
    for item in report['items'][:20]:  # Cap at 20 lines to stay readable
        lines.append(
            f"• `{item['type']}` {item['id']} — {item['detail']} "
            f"— *${item['monthly_cost_usd']:.2f}/mo* — owner: {item['owner']}"
        )
    lines.append("\nReview and delete anything no longer needed.")

    req = urllib.request.Request(
        SLACK_WEBHOOK,
        data=json.dumps({'text': '\n'.join(lines)}).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)


def lambda_handler(event, context):
    report = find_orphaned_resources()
    post_to_slack(report)
    return {
        'items_found': len(report['items']),
        'monthly_waste': report['monthly_waste_usd'],
    }

3.2 Cost Estimation in Your CI/CD Pipeline

The goal is to catch expensive infrastructure changes at the PR stage — before they deploy and before they generate a billing surprise.

# .github/workflows/cost-check.yml
# Runs on any PR that touches infrastructure files
# Uses Infracost to estimate the monthly cost delta

name: Infrastructure Cost Check

on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'infrastructure/**'
      - '*.tf'

jobs:
  cost-estimate:
    name: Estimate monthly cost change
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Generate cost estimate
        run: |
          infracost breakdown \
            --path terraform/ \
            --format json \
            --out-file /tmp/infracost.json

      - name: Post cost diff to PR
        uses: infracost/actions/comment@v3
        with:
          path: /tmp/infracost.json
          behavior: update

      - name: Block if monthly increase exceeds threshold
        run: |
          MONTHLY_DELTA=$(cat /tmp/infracost.json | \
            jq '.projects[0].diff.totalMonthlyCost' | tr -d '"')

          echo "Estimated monthly cost change: \$$MONTHLY_DELTA"

          # Fail the PR if this change adds more than $500/month
          python3 -c "
          import sys
          delta = float('$MONTHLY_DELTA')
          if delta > 500:
              print(f'PR blocked: estimated +\\({delta:.2f}/month exceeds \\)500 threshold')
              sys.exit(1)
          else:
              print(f'Cost check passed: estimated +\${delta:.2f}/month')
          "

Stage 4: The Cloud Financial Manager — Months 16 to 24

4.1 Leading FinOps Reviews with Executives

By month 16, you have the data. What changes at Stage 4 is the audience. You're no longer presenting to engineers who understand instance types and NAT Gateway pricing. You're presenting to a CTO who wants to know if the infrastructure investment is proportional to the business value it produces, and a CFO who wants to know when the line will stop going up.

The vocabulary shift is simple but important. You stop saying "we right-sized our EC2 instances" and start saying "we reduced our infrastructure unit cost by 28% while maintaining the same request throughput." You stop saying "we eliminated NAT Gateway charges" and start saying "we closed a $6,400/month gap between what we were paying and what was necessary."

The metric that anchors every executive FinOps conversation is cost per business unit. Not total bill (cost per API call, cost per user, cost per transaction, cost per model inference). That ratio tells the story of whether your infrastructure efficiency is improving as the business scales.

# unit_economics.py
# Calculate cost per transaction — the metric that matters to leadership

import boto3
from datetime import datetime, timedelta

def calculate_cost_per_transaction(service_name, transaction_count, days_back=30):
    """
    Returns cost per transaction for a given service over the last N days.
    transaction_count: total transactions for the same period (from your metrics)
    """
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        Filter={
            'Tags': {
                'Key':    'Service',
                'Values': [service_name]
            }
        }
    )

    total_cost = sum(
        float(period['Total']['UnblendedCost']['Amount'])
        for period in response['ResultsByTime']
    )

    cost_per_txn = total_cost / transaction_count if transaction_count > 0 else 0

    return {
        'service':           service_name,
        'period_days':       days_back,
        'total_cost_usd':    round(total_cost, 2),
        'transactions':      transaction_count,
        'cost_per_txn_usd':  round(cost_per_txn, 6),
    }


# Example: payment service processed 4.2M transactions this month
result = calculate_cost_per_transaction('payment-api', 4_200_000)
print(f"Cost per transaction: ${result['cost_per_txn_usd']:.6f}")
print(f"Total infrastructure cost: ${result['total_cost_usd']:,.2f}")

4.2 The Chargeback and Showback Models

Chargeback means actually billing departments for their cloud usage. Showback means showing departments their usage costs without the internal billing transfer. Both create the same outcome: engineers start caring about what they consume because someone they work with is paying attention to it.

# showback_report.py
# Generates monthly cost-by-team report for distribution to engineering leads

import boto3
from datetime import datetime

def generate_team_showback():
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG',       'Key': 'Team'},
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
        ]
    )

    by_team = {}
    for group in response['ResultsByTime'][0].get('Groups', []):
        team    = group['Keys'][0].replace('Team$', '') or 'untagged'
        service = group['Keys'][1]
        cost    = float(group['Metrics']['UnblendedCost']['Amount'])

        if team not in by_team:
            by_team[team] = {'total': 0, 'services': {}}
        by_team[team]['total'] += cost
        by_team[team]['services'][service] = round(cost, 2)

    # Print sorted by total cost descending
    print(f"\n{'='*52}")
    print(f"  Month-to-Date Cloud Spend by Team")
    print(f"  Generated: {datetime.now().strftime('%Y-%m-%d')}")
    print(f"{'='*52}\n")

    for team, data in sorted(by_team.items(), key=lambda x: x[1]['total'], reverse=True):
        print(f"  {team:<20} ${data['total']:>10,.2f}/month")
        top_services = sorted(data['services'].items(), key=lambda x: x[1], reverse=True)[:3]
        for svc, cost in top_services:
            print(f"    └─ {svc:<30} ${cost:>8,.2f}")
    print()

generate_team_showback()

The tools that matter at each stage of this roadmap:

Stage Tool Why It Matters
1 AWS Cost Explorer Free, built-in, the starting point for all cost analysis
1 AWS CLI ce commands Scriptable cost queries — dashboards can't be automated
2 AWS Compute Optimizer ML-powered rightsizing recommendations for EC2 and RDS
2 VPA (Kubernetes) Pod-level rightsizing recommendations using actual usage
3 Infracost PR-level cost estimation for Terraform changes
3 AWS Budgets Proactive alerts — catches problems before the monthly invoice
4 AWS Cost and Usage Report + Athena SQL-level billing analysis at any granularity
4 CloudHealth or Vantage Multi-account, multi-cloud cost management

The one certification worth your time: FinOps Certified Practitioner from the FinOps Foundation. It takes 20 hours to prepare and $300 to sit. It signals to hiring managers and clients that you understand the discipline formally — which matters when you're the person leading FinOps conversations at the executive level.

Your 90-Day Action Plan

Month 1 — Foundation:

Enable Cost Explorer if it isn't already on. Pull the baseline command from Section 1.1 and save the output. Run the untagged resource query from Section 1.2 and document how many resources are missing tags. Find your top three cost drivers. Present the findings to your engineering manager — not as a problem, but as an opportunity with a dollar figure attached.

Month 2 — Quick Wins:

Run the rightsizing analyser from Section 2.1 on your EC2 fleet. Downsize the three highest-confidence candidates. Apply S3 lifecycle policies to your two largest buckets. Create VPC endpoints for S3, ECR, and DynamoDB. Estimate the savings from each action and document them against your baseline.

Month 3 — Automation and Habits:

Deploy the orphan reporter Lambda on a Sunday schedule. Add the cost check GitHub Action to your infrastructure repository. Start a monthly FinOps review meeting — even if it's just you and one other engineer. Build the habit before you need the audience.

Best Practices Summary

Do: Establish a cost baseline before any optimisation. The number is meaningless without a comparison point.

Do: Right-size before buying Savings Plans. Always. The sequence changes the outcome.

Do: Enforce tagging at the infrastructure layer — Terraform or CloudFormation — not as a process reminder.

Do: Move staging and development to Spot instances. The interruption rate is manageable, while the 70% cost difference is not.

Do: Add VPC endpoints for S3, ECR, and DynamoDB before reviewing data transfer costs. It's a 30-minute fix for a multi-thousand-dollar line item.

Do: Present cost findings as cost-per-business-metric, not as total bill. "We reduced cost per transaction from \(0.0021 to \)0.0013" is a business result. "$38,000/month reduction" is an accounting result.

Don't: Buy Savings Plans on an unoptimised baseline. You'll lock in discounted waste.

Don't: Build FinOps dashboards before tagging is complete. Beautiful charts with no attribution data answer no questions.

Don't: Run orphaned resource cleanup without human review first. Run in report-only mode for two weeks, verify the candidates are genuinely orphaned, then add deletion logic.

Resources

Ayobami Adejumo is a senior platform engineer and FinOps consultant. He has audited AWS infrastructure for 20+ Series A and Series B companies. He is an active FinOps Foundation Supporter



Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started