Terraform State In A Team: The Setup That Stops Two Engineers From Corrupting Prod

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2024-01-19 · via The Practical Developer

The team has a terraform.tfstate file in the repo. Engineer A runs terraform apply. So does engineer B, on a slightly different version of the code. Both push their state files. The merged state file is now inconsistent with what’s actually in AWS: some resources tracked, some not, some duplicated. The next apply proposes deleting half the production infrastructure.

Local state is fine for one person. For a team, you need remote state with locking: state stored in a shared backend (S3, Terraform Cloud, GCS), with a lock that prevents two applys from running simultaneously. About 10 lines of HCL, and it eliminates an entire category of disaster.

This post is the working setup, the workspaces-vs-directories debate, and the four habits that keep multi-engineer Terraform sane.

Remote state with locking

The S3 + DynamoDB pattern is the AWS-native standard:

terraform {
  backend "s3" {
    bucket         = "company-tf-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "tf-state-lock"
  }
}

Setup once (in a separate project, the chicken-and-egg problem):

resource "aws_s3_bucket" "tf_state" {
  bucket = "company-tf-state"
}

resource "aws_s3_bucket_versioning" "tf_state" {
  bucket = aws_s3_bucket.tf_state.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_dynamodb_table" "tf_state_lock" {
  name         = "tf-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  attribute {
    name = "LockID"
    type = "S"
  }
}

The S3 bucket stores the state; versioning protects you from accidental overwrites. The DynamoDB table is the lock: at any moment, only one apply can hold the lock. The second engineer running apply gets:

Error: Error acquiring the state lock

…and waits, instead of corrupting state.

For GCP: GCS bucket. For Azure: Storage Account. The pattern is identical.

Terraform Cloud as a managed alternative

Terraform Cloud (free tier covers small teams) gives you remote state, locking, run logs, and access controls without the S3+DynamoDB setup:

terraform {
  cloud {
    organization = "your-org"
    workspaces { name = "prod" }
  }
}

Pros: no infrastructure to maintain, run history is visible to the team, role-based access. Cons: external dependency, free tier is genuinely small.

For most teams the S3+DynamoDB approach is fine. For teams with multiple environments and many engineers, Terraform Cloud is worth the price.

State files per environment, not per resource

A common mistake: one giant state file with everything (prod, staging, dev, all services). The first time you need to refactor a single service, the state file is monolithic and terraform plan takes 10 minutes.

The right structure:

infra/
├── prod/
│   ├── network/      ← own state file
│   ├── eks/          ← own state file
│   ├── rds/          ← own state file
│   └── apps/
│       ├── api/      ← own state file
│       └── worker/   ← own state file
├── staging/
│   └── ...

Each directory has its own backend config and state. Changes to one component don’t risk another. Plans are fast.

The trade-off: cross-component references require explicit terraform_remote_state data sources or output sharing. That’s fine; it makes dependencies explicit.

Workspaces vs directories

Terraform’s “workspaces” feature lets one config target multiple environments (terraform workspace select staging). Tempting, controversial.

The case for workspaces: less duplication, single source.

The case against: easy to apply staging changes to prod by accident (you’re “in” the staging workspace but you forgot to switch). Conditional logic on workspace name (var.env == "prod") gets ugly.

For most teams, directories (one folder per environment) is safer. Workspaces are useful for ephemeral environments (per-PR previews) where the lifecycle is short and the safety risk is low.

The four habits that keep IaC sane

1. Always run plan before apply. Read the plan output. Confirm only the resources you expect are being modified. CI should make this automatic: plan on PRs, apply only on main after review.

2. Pin provider versions.

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.30.0"
    }
  }
}

Without pins, terraform init happily upgrades to a new major version that changes resource schemas. Pin tightly; bump deliberately.

3. Don’t manually edit state. terraform state rm, terraform state mv exist but are dangerous. If a resource drifted (someone changed it in the cloud console), prefer terraform import to bring it back into state, not state edits.

4. Use modules for repetition. A reusable module is one place to fix bugs. Inline copy-paste is many places to fix bugs.

module "api_service" {
  source = "../modules/ecs-service"
  name   = "api"
  image  = "ghcr.io/company/api:abc123"
  cpu    = 1024
  memory = 2048
}

Build a small library of internal modules (ECS service, RDS, Lambda) so adding a new service is 5 lines.

CI/CD for Terraform

A working pipeline:

# .github/workflows/tf.yml
on:
  pull_request:
    paths: ['infra/**']
  push:
    branches: [main]
    paths: ['infra/**']

jobs:
  plan:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: hashicorp/setup-terraform@v3
    - name: terraform plan
      working-directory: infra/prod/api
      run: |
        terraform init
        terraform plan -out=tfplan -no-color | tee plan.txt
    - name: comment on PR
      uses: actions/github-script@v7
      with:
        script: |
          const fs = require('fs');
          const plan = fs.readFileSync('infra/prod/api/plan.txt', 'utf8');
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: '```\n' + plan.slice(0, 60000) + '\n```',
          });

  apply:
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    needs: plan
    environment: prod    # require manual approval
    steps:
    - uses: actions/checkout@v4
    - uses: hashicorp/setup-terraform@v3
    - name: terraform apply
      working-directory: infra/prod/api
      run: |
        terraform init
        terraform apply -auto-approve

PRs get a plan posted to the PR. Merging triggers an apply that requires manual approval (via GitHub Environments). No engineer runs apply from a laptop.

Atlantis and Terraform Cloud automate this pattern further. For small teams, the GitHub Actions version above is enough.

Drift detection

Resources change outside Terraform: manual fixes during incidents, AWS console edits, automatic policy adjustments. After enough drift, terraform plan proposes “fixing” things that were intentionally changed.

A scheduled drift-detection run catches this:

# Run weekly: plan-only, alert if anything to change.
- run: terraform plan -detailed-exitcode
  # exit 0 = no changes, 2 = changes proposed

Integrate the alert with your incident system. The team can decide: import the change into state, or revert it.

Don’t put secrets in tfvars

# DON'T
variable "db_password" {
  default = "actual-password"
}

State file contains the value, in plaintext, in the S3 bucket. Anyone with bucket access has the password.

Instead:

# DO: pull from a secrets manager at runtime.
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/db/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

The state file may still record the value (depends on resource type). Treat the state file as sensitive: encrypt at rest, restrict access.

Modules vs Terragrunt

Terragrunt wraps Terraform with conventions for DRY environment configs and easier remote state setup. It is excellent for teams managing many environments. For a small team starting out, plain Terraform with directories is enough.

If you adopt Terragrunt later, the migration is straightforward. Terragrunt is a layer on top, not a replacement.

The takeaway

Local Terraform state is for solo work. For a team: remote state with locking (S3 + DynamoDB or Terraform Cloud), separate state files per environment / component, plan-on-PR / apply-on-merge in CI, modules for reuse, drift detection on a cron. Pin provider versions, don’t edit state by hand, keep secrets out of tfvars.

The setup takes a day. It pays for itself the first week somebody else on the team tries to apply infrastructure changes simultaneously.

A note from Yojji

The kind of infrastructure-as-code discipline that scales from one engineer to twenty (remote state, locking, modules, CI gates) is the kind of long-haul DevOps engineering Yojji’s teams put into the cloud platforms they ship for clients.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms (AWS, Azure, GCP), and infrastructure operations, including the Terraform structure and process that decides whether your infra stays manageable as the team grows.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

Remote state with locking

Terraform Cloud as a managed alternative

State files per environment, not per resource

Workspaces vs directories

The four habits that keep IaC sane

CI/CD for Terraform

Drift detection

Don’t put secrets in tfvars

Modules vs Terragrunt

The takeaway

A note from Yojji