Secrets Management For Real Teams: Vault, SOPS, And The .env File You Should Burn

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2023-10-27 · via The Practical Developer

The new engineer joins the team. Onboarding instructions: “ask Slack for the .env file.” A senior engineer DMs them a 4 KB file with database passwords, API keys, and the Stripe live secret. They paste it into 1Password “for safekeeping.” Six months later the engineer has left, the file is still in their personal vault, and nobody knows where else copies of those secrets live.

This is how every team’s secrets management starts and how most stay. The fix is not “be more careful.” It is to put secrets in a system where the answer to “who has access” is auditable, and where rotation is a button-press, not a Slack thread.

This post covers the three credible options for production secrets: HashiCorp Vault, SOPS-encrypted-in-git, and cloud-native (AWS/GCP/Azure), with the trade-offs and the rotation policy that holds up.

What “managing secrets” actually means

Five operations that any system worth using must support:

Store a secret with a name and access policy.
Read it from production at runtime.
Audit who read it and when.
Rotate it without redeploying.
Revoke access when somebody leaves.

A .env file in a chat tool fails 3, 4, and 5. A “secrets manager” that requires a redeploy to rotate fails 4. A system without per-secret ACLs fails 5.

The three options below all support these operations to different degrees.

Option 1: HashiCorp Vault

Vault is the heaviest, most flexible option. Run it as a service. Apps authenticate to Vault and request secrets at runtime.

# App authenticates with a Kubernetes service account.
$ vault read database/creds/app-readonly
Key                Value
---                -----
lease_id           database/creds/app-readonly/abc123
lease_duration     1h
lease_renewable    true
password           A1b-2C3d-...
username           v-token-app-...-abc

Strengths:

Dynamic secrets. Vault can generate per-app DB credentials with a TTL of one hour. The credentials are auto-revoked. There is no static “DB_PASSWORD” anywhere.
PKI. Vault can issue and rotate TLS certs.
Transit engine. Encrypt arbitrary data with Vault as the key manager.
Audit log. Every read is logged.
Policy-based access. Fine-grained ACLs.

Costs:

You run a service. Vault is a stateful, replicated, hard-to-operate distributed system. If Vault is down, your apps cannot start.
Learning curve. Auth methods, policies, secret engines; it is a real product to learn.
Total cost. HashiCorp’s commercial pricing is non-trivial; the open-source version is free but requires real ops investment.

Vault is the right answer for organizations large enough to have a dedicated security or platform team. It is overkill for a 5-engineer startup.

Option 2: SOPS-encrypted in Git

SOPS (Secrets OPerationS) encrypts the values of a YAML or JSON file using a KMS key, leaving the keys in plaintext. The encrypted file lives in Git like any other config:

# secrets.enc.yaml
database:
  host: db.example.com
  password: ENC[AES256_GCM,data:abc123,iv:...,tag:...,type:str]
stripe:
  api_key: ENC[AES256_GCM,data:def456,...,type:str]
sops:
  kms:
  - arn: arn:aws:kms:us-east-1:...:key/abc-123

Anyone with access to the KMS key can decrypt; the file is otherwise gibberish. Use AWS KMS, GCP KMS, Azure Key Vault, or age for a key-pair-based approach.

Strengths:

Lives in Git. No separate service. Audit log is the git history. Diffs are reviewable.
Cheap. KMS keys cost cents per month.
Simple. Decryption happens at deploy time; runtime is a plain env var.

Costs:

No dynamic secrets. Rotation requires a commit + deploy.
Encrypted-at-rest only. Once decrypted, the secret is in your build artifacts, your CI logs (if you’re not careful), or your container’s env.
No per-app ACLs. Everyone with the KMS key can decrypt everything.

For most teams under ~50 engineers, SOPS hits the sweet spot. The “secrets are in Git” psychology takes getting used to but the encryption is genuinely strong.

Option 3: Cloud-native (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault)

Each cloud has a managed secrets service. They look similar:

aws secretsmanager get-secret-value --secret-id prod/db/password

Strengths:

Managed. No service to run.
Native integration. EKS pods can mount secrets via CSI drivers; Lambda has built-in support; ECS tasks read at startup.
Per-secret ACLs. IAM policies scope access.
Audit logs in CloudTrail / Cloud Audit / Activity Logs.

Costs:

Per-secret pricing. AWS Secrets Manager is $0.40/secret/month, which adds up to real money at scale.
Cloud lock-in. A multi-cloud or hybrid setup needs a different approach.
Rotation is supported but uneven. AWS RDS rotation works well; rotation for arbitrary third-party APIs is on you.

For teams already deep in one cloud, the cloud-native option is usually the lowest-effort credible answer.

The decision tree

A realistic flowchart:

Do you need dynamic per-app database credentials with TTL? Yes → Vault.
Are you single-cloud and already using KMS / IAM heavily? Yes → cloud-native.
Are you a small team that wants minimum operational surface? SOPS in Git.
Some combination? SOPS for static long-lived secrets (third-party API keys), cloud-native or Vault for high-rotation database credentials.

Most teams start with #3 (SOPS) and migrate to #2 or #1 when scale demands it. Migration is real work; pick something you can live with for a couple of years.

The rotation policy that holds up

Whatever tool you pick, you need a rotation policy. The realistic version:

Database passwords: rotate quarterly. With Vault dynamic credentials, this is automatic. With static creds, schedule a maintenance window per quarter.
Third-party API keys: rotate when an employee leaves who had access to them, or annually.
Cloud IAM keys: prefer instance roles / workload identity; long-lived keys rotated quarterly.
Signing keys (JWT, etc.): rotate every 6-12 months with a grace period.
TLS certs: automated via cert-manager / Let’s Encrypt; verify the renewal works.

The frequency matters less than the practice of rotating. A team that has rotated zero secrets in two years has secrets they cannot rotate (because they don’t remember which apps use them).

What to do when somebody leaves

The “engineer leaves” checklist is the test of your secrets management. The right answer is:

Revoke their personal access (SSO, GitHub, cloud accounts), automated.
Rotate any secret they had access to, possibly automated.

If step 2 is “ask around what they had access to,” you have a problem. The secrets-management system should answer “what could this user have read in the last 90 days?” from audit logs.

Things to never put in a secrets manager

A few patterns that look like secrets but aren’t:

Configuration that is not sensitive (port numbers, feature flags). Put in plain config files.
Public-facing API keys (frontend to identify your account). They are public; don’t pretend.
Secrets that change every request (per-user JWTs). Mint at runtime, don’t store.

The secrets manager is for things that are sensitive and long-lived. Anything else belongs elsewhere.

Common antipatterns

Storing secrets in environment variables in CI logs. A set -x in a shell script and you’ve leaked. Use the CI’s secret-injection mechanism (GitHub Actions secrets, GitLab masked variables) and never echo them.

Embedding secrets in container images. Multi-stage builds have helped, but a RUN apt-get && SECRET=abc... baked into a layer means anyone with the image has the secret. Use build secrets (docker build --secret) or runtime injection.

Single shared key for everything. “Master API key” with full access to everything. Should be split: read-only for analytics, write for billing, etc.

No backup of root keys. If you lose your KMS key or Vault root token, secrets become unrecoverable. Have an offline backup with a clearly-documented recovery procedure.

The takeaway

A .env file in 1Password is a starting point, not a destination. Pick a real secrets management system (Vault, SOPS-in-Git, cloud-native) before you have to. Establish a rotation rhythm. Make “who can read this secret” a query, not a Slack thread.

The next time an engineer leaves, the response should be one button-press, not a six-hour audit. The next time you need to rotate a database password, it should be a Friday afternoon, not a sprint.

A note from Yojji

The kind of security hygiene that turns “we know who has access to what” from a wish into a query (secrets management, rotation, audit logs) is the kind of long-haul engineering discipline Yojji’s teams put into the products they hand back to clients.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms (AWS, Azure, GCP), and full-cycle product engineering, including the security and operations work that decides whether your secrets are auditable or scattered across team chats.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

What “managing secrets” actually means

Option 1: HashiCorp Vault

Option 2: SOPS-encrypted in Git

Option 3: Cloud-native (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault)

The decision tree

The rotation policy that holds up

What to do when somebody leaves

Things to never put in a secrets manager

Common antipatterns

The takeaway

A note from Yojji