惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google Online Security Blog
Google Online Security Blog
T
Threat Research - Cisco Blogs
G
GRAHAM CLULEY
AWS News Blog
AWS News Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
I
Intezer
A
Arctic Wolf
D
Darknet – Hacking Tools, Hacker News & Cyber Security
C
CERT Recently Published Vulnerability Notes
The Register - Security
The Register - Security
L
LangChain Blog
B
Blog
G
Google Developers Blog
K
Kaspersky official blog
T
Tenable Blog
S
Securelist
C
CXSECURITY Database RSS Feed - CXSecurity.com
P
Privacy & Cybersecurity Law Blog
I
InfoQ
P
Palo Alto Networks Blog
NISL@THU
NISL@THU
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Stack Overflow Blog
Stack Overflow Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
S
Secure Thoughts
D
Docker
雷峰网
雷峰网
The Last Watchdog
The Last Watchdog
S
SegmentFault 最新的问题
Webroot Blog
Webroot Blog
月光博客
月光博客
美团技术团队
Cyberwarzone
Cyberwarzone
腾讯CDC
F
Full Disclosure
Scott Helme
Scott Helme
量子位
The Cloudflare Blog
C
Comments on: Blog
PCI Perspectives
PCI Perspectives
V
Visual Studio Blog
阮一峰的网络日志
阮一峰的网络日志
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Tor Project blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园 - 【当耐特】
S
Schneier on Security
P
Proofpoint News Feed
Security Latest
Security Latest

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
I Inherited 47,000 Lines of Terraform Spaghetti — Here's How I Untangled It Without Burning Production
S, Sanjay · 2026-05-22 · via DEV Community

The Slack Message That Ruined My Monday

"Hey, the previous platform team left. Here's the repo. Good luck 🫡"

I stared at the Git repository. 47,000 lines of Terraform. One state file. Zero modules. Variables named x, temp2, and my personal favorite — DO_NOT_TOUCH_ask_raj. Raj had left the company two years ago.

If you've been a Senior DevOps Engineer for more than a year, you've inherited something like this. Maybe not 47K lines, but you've opened a main.tf that made you question your career choices.

This isn't a "Terraform best practices" article. Those are written by people who've never had to run terraform plan on a 3,000-resource state file at 2 AM while the VP of Engineering watches.

This is a survival guide.


Anti-Pattern #1: The Monolith State File (aka "The Single Point of Career Failure")

What I Found

# main.tf — 8,400 lines
# "Managed" networking, compute, databases, DNS, IAM, monitoring,
# and somehow... a CloudFront distribution for a marketing site
# that was decommissioned in 2023.

resource "aws_vpc" "main" { ... }
resource "aws_instance" "api_server_1" { ... }
resource "aws_instance" "api_server_2" { ... }
# ... 200 more instances ...
resource "aws_rds_instance" "prod_db" { ... }
resource "aws_iam_role" "god_mode" { ... }  # yes, really

Enter fullscreen mode Exit fullscreen mode

A single terraform apply touched everything. Networking, databases, compute, DNS — all entangled like Christmas lights in January. One typo in a security group rule? Congratulations, your plan just showed 847 resources to evaluate, and Terraform decided your RDS instance needs replacing.

The Real Danger

This isn't just messy — it's operationally catastrophic. Here's what happens:

  • terraform plan takes 14 minutes. Developers stop running it.
  • State file locking means only one person can work at a time.
  • Blast radius of any mistake = the entire infrastructure.
  • New team members are terrified to touch anything (rightfully so).

How I Fixed It (Without Downtime)

Step 1: State Surgery with terraform state mv

# First, I mapped resource dependencies visually
terraform graph | dot -Tsvg > infra-dependency-map.svg

# Then, split by domain boundaries
terraform state mv 'aws_vpc.main' -state-out=networking/terraform.tfstate
terraform state mv 'aws_subnet.public[0]' -state-out=networking/terraform.tfstate
terraform state mv 'aws_subnet.public[1]' -state-out=networking/terraform.tfstate

Enter fullscreen mode Exit fullscreen mode

Step 2: Introduce State Boundaries by Blast Radius

I split into five state files based on change frequency and blast radius:

Layer Contents Change Frequency Blast Radius
foundation VPC, Subnets, Route Tables Monthly Critical
security IAM, KMS, Security Groups Weekly Critical
data RDS, ElastiCache, S3 Rare Catastrophic
compute ECS/EKS, ASGs, ALBs Daily High
edge CloudFront, Route53, WAF Weekly Medium

Step 3: Wire Them Together with Remote State Data Sources

# In compute/main.tf
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "company-terraform-state"
    key    = "foundation/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_ecs_service" "api" {
  # Reference networking outputs safely
  network_configuration {
    subnets = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}

Enter fullscreen mode Exit fullscreen mode

Result: terraform plan went from 14 minutes to 45 seconds. Team velocity tripled. I stopped getting 2 AM pages about state locks.


Anti-Pattern #2: The Copy-Paste Empire (aka "Modules at Home")

What I Found

environments/
├── dev/
│   └── main.tf      # 1,200 lines
├── staging/
│   └── main.tf      # 1,200 lines (95% identical to dev)
├── prod/
│   └── main.tf      # 1,200 lines (90% identical... with 47 "hotfixes")
└── dr/
    └── main.tf      # 1,200 lines (copied from prod 8 months ago, never updated)

Enter fullscreen mode Exit fullscreen mode

Four copies of the same infrastructure with subtle drift. Staging had a security group rule that prod didn't. DR was missing three services entirely. Nobody knew which differences were intentional.

Why This Kills Senior Engineers

You can't diff your way out of this. The files have diverged in ways that are both intentional (prod has larger instances) and accidental (someone fixed a bug in dev but forgot to propagate it). You have no source of truth.

The Refactoring Strategy That Actually Works

Don't try to unify everything at once. I learned this the hard way after a failed "big bang" refactor that took 3 sprints and broke staging for a week.

Instead, use the Strangler Fig pattern:

# modules/api-platform/main.tf
variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod", "dr"], var.environment)
    error_message = "Environment must be dev, staging, prod, or dr."
  }
}

variable "config" {
  type = object({
    instance_type    = string
    min_capacity     = number
    max_capacity     = number
    enable_waf       = bool
    multi_az         = bool
    backup_retention = number
  })
}

locals {
  # Environment-specific defaults that document WHY they differ
  env_config = {
    dev = {
      instance_type    = "t3.medium"
      min_capacity     = 1
      max_capacity     = 2
      enable_waf       = false
      multi_az         = false
      backup_retention = 1
    }
    prod = {
      instance_type    = "m5.xlarge"
      min_capacity     = 3
      max_capacity     = 20
      enable_waf       = true
      multi_az         = true
      backup_retention = 35
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

The key insight: Every environment difference should be documented in code as a conscious decision, not hidden in a 1,200-line file as an accidental divergence.


Anti-Pattern #3: The terraform apply -auto-approve YOLO Pipeline

What I Found in .gitlab-ci.yml

deploy_prod:
  stage: deploy
  script:
    - terraform init
    - terraform apply -auto-approve  # 🚨 WHAT
  only:
    - main

Enter fullscreen mode Exit fullscreen mode

No plan artifact. No approval gate. No diff review. Push to main → infrastructure changes in production. The commit history told the horror story:

fix: revert the revert of the fix
fix: actually fix prod this time
fix: ok THIS one fixes it
revert: revert everything from today

Enter fullscreen mode Exit fullscreen mode

What Senior Engineers Actually Need

# .github/workflows/terraform.yml
name: "Terraform"

on:
  pull_request:
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Terraform Plan
        id: plan
        run: |
          terraform init
          terraform plan -no-color -out=tfplan \
            -detailed-exitcode 2>&1 | tee plan_output.txt
        continue-on-error: true

      - name: Comment Plan on PR
        uses: actions/github-script@v7
        if: github.event_name == 'pull_request'
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('plan_output.txt', 'utf8');
            const truncated = plan.length > 60000 
              ? plan.substring(0, 60000) + '\n\n... truncated ...' 
              : plan;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan Output\n\`\`\`\n${truncated}\n\`\`\``
            });

      - name: Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: tfplan
          path: tfplan

  apply:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production  # Requires manual approval
    steps:
      - uses: actions/checkout@v4

      - name: Download Plan
        uses: actions/download-artifact@v4
        with:
          name: tfplan

      - name: Terraform Apply
        run: terraform apply tfplan  # Apply ONLY the reviewed plan

Enter fullscreen mode Exit fullscreen mode

The non-negotiable rules:

  1. Plans are generated on PR and attached as artifacts.
  2. Humans review the diff before any production apply.
  3. Apply uses the exact plan that was reviewed (not a new plan).
  4. The production environment requires manual approval from a senior engineer.

Anti-Pattern #4: Secrets in State (The Ticking Compliance Bomb)

What I Found

resource "aws_db_instance" "prod" {
  engine               = "postgres"
  instance_class       = "db.r5.2xlarge"
  username             = "admin"
  password             = "Pr0d_P@ssw0rd_2022!"  # I wish I was joking
  publicly_accessible  = true                    # I really wish I was joking
}

Enter fullscreen mode Exit fullscreen mode

The password was in the .tf file, the state file, the plan output, and the Git history. Four places to leak from. And publicly_accessible = true was the cherry on this dumpster fire sundae.

The Fix (That Also Passes Audit)

# Use a data source to pull secrets at plan/apply time
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/rds/master-password"
}

resource "aws_db_instance" "prod" {
  engine              = "postgres"
  instance_class      = "db.r5.2xlarge"
  username            = "admin"
  password            = data.aws_secretsmanager_secret_version.db_password.secret_string
  publicly_accessible = false

  # Prevent Terraform from detecting password "drift"
  lifecycle {
    ignore_changes = [password]
  }
}

Enter fullscreen mode Exit fullscreen mode

But that's not enough. The state file still contains sensitive values. The complete solution:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/data/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true                          # SSE-KMS encryption
    kms_key_id     = "arn:aws:kms:us-east-1:xxx:key/yyy"
    dynamodb_table = "terraform-state-lock"
  }
}

Enter fullscreen mode Exit fullscreen mode

Plus strict S3 bucket policies, access logging, and never giving developers direct state file access. Use terraform output instead.


Anti-Pattern #5: The "God Resource" With 200 Lines of Nested Blocks

What I Found

resource "aws_ecs_task_definition" "api" {
  family                   = "api"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 1024
  memory                   = 2048
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "api"
      image = "company/api:latest"  # 🚨 LATEST TAG IN PROD
      portMappings = [{ containerPort = 8080 }]
      environment = [
        { name = "DB_HOST", value = "prod-db.cluster-xxx.us-east-1.rds.amazonaws.com" },
        { name = "DB_NAME", value = "production" },
        { name = "REDIS_URL", value = "prod-redis.xxx.cache.amazonaws.com:6379" },
        # ... 45 more environment variables hardcoded here ...
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/api"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "api"
        }
      }
      # ... 80 more lines of health checks, mount points, ulimits ...
    }
  ])
}

Enter fullscreen mode Exit fullscreen mode

The problems compound:

  • Environment variables are hardcoded (not sourced from SSM/Secrets Manager).
  • latest tag means deployments are non-reproducible.
  • The jsonencode blob is untestable and un-diffable in PR reviews.
  • One change to any env var triggers a full task definition replacement.

The Refactored Version

# Use templatefile for complex JSON — it's testable and readable
resource "aws_ecs_task_definition" "api" {
  family                   = "api-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.task_cpu
  memory                   = var.task_memory
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = templatefile("${path.module}/templates/api-container.json.tpl", {
    image_tag     = var.image_tag  # Pinned, passed from CI/CD
    environment   = var.environment
    db_host       = data.aws_ssm_parameter.db_host.value
    redis_url     = data.aws_ssm_parameter.redis_url.value
    log_group     = aws_cloudwatch_log_group.api.name
    aws_region    = data.aws_region.current.name
  })
}

Enter fullscreen mode Exit fullscreen mode


The Refactoring Playbook (Do This Monday)

After untangling this mess across three months, here's the sequence that works:

Week 1: Triage and Protect

# 1. Enable state file encryption and locking NOW
# 2. Add branch protection — no direct pushes to main
# 3. Run terraform plan and SAVE the output as your baseline
terraform plan -no-color > baseline_plan_$(date +%Y%m%d).txt

# 4. Enable detailed audit logging on your state bucket

Enter fullscreen mode Exit fullscreen mode

Week 2-4: Split the Monolith

# Use terraform state list to inventory everything
terraform state list > all_resources.txt
wc -l all_resources.txt  # Mine had 2,847 resources

# Group by service domain
grep "aws_vpc\|aws_subnet\|aws_route" all_resources.txt > networking.txt
grep "aws_iam\|aws_kms" all_resources.txt > security.txt
grep "aws_rds\|aws_elasticache\|aws_s3" all_resources.txt > data.txt
grep "aws_ecs\|aws_alb\|aws_autoscaling" all_resources.txt > compute.txt

Enter fullscreen mode Exit fullscreen mode

Week 5-8: Modularize Incrementally

Move one service at a time into a module. After each move:

  1. Run terraform plan — it should show zero changes.
  2. If plan shows changes, you have a bug. Fix it before moving on.
  3. Get a PR review from another senior engineer.
  4. Apply and monitor for 24 hours.

Week 9-12: Harden the Pipeline

  • Add terraform validate and tflint to CI.
  • Add checkov or tfsec for security scanning.
  • Implement drift detection (scheduled plan that alerts on differences).
  • Add cost estimation with infracost.

The Drift Detection Cron That Saved Us

This is the thing nobody talks about. Even after a perfect refactor, drift happens. Someone clicks in the console. An auto-remediation tool makes changes. A Lambda modifies a security group.

# .github/workflows/drift-detection.yml
name: "Drift Detection"

on:
  schedule:
    - cron: '0 6 * * 1-5'  # Every weekday at 6 AM

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        layer: [foundation, security, data, compute, edge]
    steps:
      - uses: actions/checkout@v4

      - name: Terraform Plan (Drift Check)
        id: plan
        working-directory: infrastructure/${{ matrix.layer }}
        run: |
          terraform init
          terraform plan -detailed-exitcode -no-color > plan.txt 2>&1
          echo "exitcode=$?" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Alert on Drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          # Exit code 2 = changes detected (drift!)
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-type: application/json' \
            -d "{\"text\":\"🚨 Drift detected in *${{ matrix.layer }}* layer. Check the plan output.\"}"

Enter fullscreen mode Exit fullscreen mode

We caught 3 unauthorized console changes in the first week alone.


Parting Wisdom for the Senior Engineer Who Just Inherited a Mess

  1. Don't refactor everything at once. You'll break things and lose credibility.

  2. Document what you find before you fix it. Screenshot the horrors. You'll need them for the post-mortem and for your performance review.

  3. Get buy-in from leadership BEFORE you start. "I need 3 sprints for tech debt" is a hard sell. "Our current setup means any infrastructure change has a 40% chance of causing an incident" gets budget approved.

  4. Every terraform state mv should be a separate, reviewed PR. Not because it's technically necessary, but because when something breaks at step 37 of 50, you want a clean git history to bisect.

  5. The goal isn't perfect Terraform. The goal is Terraform that your team can safely operate at 2 AM. If a junior engineer can't run terraform plan without fear, your refactor isn't done.


TL;DR for the Scrollers

Anti-Pattern Fix Priority
Monolith state file Split by blast radius and change frequency P0
Copy-paste environments Modules + environment configs P1
-auto-approve in CI Plan artifacts + manual approval gates P0
Secrets in state/code Secrets Manager + encrypted state + ignore_changes P0
God resources with inline JSON templatefile + SSM parameters P2
No drift detection Scheduled plan with alerting P1

If you've ever stared at a Terraform codebase and whispered "who did this?!" into the void — you're not alone. We've all been there. The good news? It's fixable. One state move at a time.


Found this useful? Follow me for more battle-tested DevOps content. I write about the stuff that actually happens in production — not the happy path from the docs.