惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Proofpoint News Feed
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Attack and Defense Labs
Attack and Defense Labs
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
W
WeLiveSecurity
O
OpenAI News
SecWiki News
SecWiki News
博客园 - Franky
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
T
Tor Project blog
Microsoft Security Blog
Microsoft Security Blog
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
P
Privacy & Cybersecurity Law Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
月光博客
月光博客
李成银的技术随笔
Spread Privacy
Spread Privacy
F
Full Disclosure
F
Fortinet All Blogs
T
The Exploit Database - CXSecurity.com
Vercel News
Vercel News
AWS News Blog
AWS News Blog
WordPress大学
WordPress大学
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
V
Visual Studio Blog
J
Java Code Geeks
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Last Week in AI
Last Week in AI
P
Palo Alto Networks Blog
宝玉的分享
宝玉的分享
T
True Tiger Recordings
N
News and Events Feed by Topic
酷 壳 – CoolShell
酷 壳 – CoolShell
Cisco Talos Blog
Cisco Talos Blog
N
News | PayPal Newsroom
S
SegmentFault 最新的问题
Jina AI
Jina AI

DEV Community

COOKIES DPRK Hacking Trends 2026: AI‑Powered Supply Chain and Developer Environment Attacks Phone control for AI coding sessions is not a tiny terminal PayPal and Crypto Are Not Equals: How I Built a Gumroad Alternative for Restricted Countries Exploring Tech as a Content Writer I Raised Gemma 4's Token Cap. The Dense Model Stopped Refusing. React Server Components Don't Make Your App Fast by Default Multi-Stage Builds for a Next.js App — Reduce Image Size by 70% I Built a Chrome Extension That Teaches Vocabulary While You Browse Why I Walked Back from Next.js and RSC to a Plain SPA and a Separate Backend NeuralPocket: Private On-Device AI with Gemma 4 — Android & Web Github Speckit: Revolucionando o Desenvolvimento com SDD Cloud Cost Elasticity I Built a Payment System for Bangladesh—Heres Why Stripe Failed Us Polyglot Persistence in Microservices: Choosing the Right Database for Each Service Centralized Authentication for a Multi-Brand Laravel Ecosystem How I made a perfect recording button. Simple yet complex thing. Mumbli – my personal Wispr Flow Getting Paid Should Not Be a Geopolitical Nightmare: My NOWPayments Integration Story Four Layers of Validation in Kubernetes with Claude Code Prompt Flow — a visual side project for flow design, trace, and integration steps (looking for feedback) AI Citation Registry: Temporal Gaps in Government Publishing Cycles ShowDev: I built a 100% local, zero-upload PDF editor using WebAssembly JavaC Written by an AI Pipeline, Verified by Three Models. Is It Slop? Part1 Vulkan: Drawing Triangle 1 Why I Stopped Using useEffect to Sync State — and What I Use Instead Por qué dejé de usar useEffect para sincronizar estado y qué uso ahora Migrating a Long-Running WordPress Site to Payload CMS (And All The Chaos That Came With It) Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans Azure DevOps Structure Explained: Organizations, Projects, and Repos Without the Mess A Simple React Hook for localStorage State, Expiry, and Sync I sold you on /scratchpad. Then I migrated to /note. Fixing WSL Errors on Windows 11 Your app is not Netflix. Stop building like it is. Resolving inter-service communication issue I built an email cleaner. CSV parsing took longer than the actual validators. How I Would Learn Full-Stack Development in 2026 If I Started From Zero Partition Evolution: Change Your Partitioning Without Rewriting Data What Google Play's I/O 2026 Updates Look Like From a Solo Indie Puzzle Developer Forgetting the Myth of "Ease of Integration" When Selling Digital Products with Bitcoin My 4-Step Regex Debugging Workflow (That Actually Saves Time) Stop Scraping Betting Sites: How to Build a Real-Time Sports Tracker in Python Civic Identity and Responsibility in Modern Democracy OLTP vs OLAP Are binaries really executable code ? The lie of the 80%: why software progress charts don't work What a Datacenter in Space Actually Buys You: Three Server Racks Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't Accessibility - This looks like a job for a developer advocate! I built a Mac app that turns web pages into live widgets How to Teach Source Evaluation When Your Students Use ChatGPT More Context Does Not Mean More Trust RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase Past the JVM Design decisions behind my “Irregular German Verbs” iOS app WordPress 7.0 "Armstrong" Is Live — Post-Release Deep Dive 🎺 Performance and Apache Iceberg's Metadata I Shipped a Bug to Production That Cost Us 3 Hours of Downtime 程序人生:在代码与时间之间 The Wrong Way to Think About XRPL Event Infrastructure What I Learned About MND, Voice Banking, and Why Assistive Tech Is Personal $1.50/Month Email Infrastructure That Beats Your $20 SendGrid Plan Cloud Unit Economics: The Metrics DevOps and FinOps Teams Actually Need Bypassing Payment Platform Restrictions Was The Best Decision I Ever Made For My Digital Product Business The Hidden Life of a Container: A Complete Lifecycle When a port is already in use, there is no interactive way to find it — so I built `port-peek` Como Sumir com o Barulho do Teclado Mecânico no Ubuntu Usando o NoiseTorch Google I/O 2026 dropped a bomb on Android tooling, and nobody's talking about it (or maybe they are 😅) Mentoring Junior Developers: What Actually Works How I Prevented Claude Code from Breaking My Architecture with 18 Tests That Run in 0.4 Seconds I Controlled an ESP32 Drone Using Only My Voice vite HMR is silently the reason ur laptop fan wont stop AI Agents Security for Developers: Don't Let Your Agents Become a Liability Single List Keyboard Handling 9 SaaS development companies worth knowing (a technical look) Material Nova — The Best VS Code Theme of 2026 Inference Routing Is Becoming an Infrastructure Placement Problem I just build a League MBTI Analytics Why I Built My Own Site with Astro, Not WordPress when I use WordPress for a Living Hello! I'm a balloon artist who started 3D modeling 7 Next.js 16 Caching Bugs That Compile Fine and Break Silently in Production I got tired of writing READMEs so I built a tool that generates them from your GitHub URL FrontGate: a Lightweight Package Proxy for Supply Chain Security Why Your Expense Tracking Architecture Keeps Breaking Stop your AI trading agent from hallucinating technical analysis Breaking the Monorepo Barrier in a Crypto Store for Digital Products Imposter Syndrome Is Something We All Struggle With at Some Point in Our Careers Moving Beyond the Black Box: How I Built a Real-Time Voice Fitness Coach using Next.js 15, Convex, & Vapi.ai How to Recover Kafka DLQ Messages After a Schema Change Broke Your Consumer From Spec-Driven Development to Attractor-Guided Engineering Githubster free tool to track your GitHub followers and unfollowers Why Bitcoin Core RPC is Too Slow for High-Frequency Trading (And How to Fix It) Why Reading Food Labels Shouldn't Feel Like Decoding a Chemistry Exam I built a "brain" for AI coding agents — it never forgets and never stops How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration) Controlling Employee AI Usage on Managed Devices: Browser Controls, Cloudflare AI Gateway, and AWS Bedrock When Global Payment Gateways Fail, Local Solutions Shine LeetCode Solution: 13. Roman to Integer End-to-End Observability for vLLM and TGI: from DCGM to Tokens
When the API literally burned your database after a typo
mary moloyi · 2026-05-21 · via DEV Community

The Problem We Were Actually Solving

We needed a staging environment that could tolerate the stupidity of humans. Not a toy cluster that looked like production but couldnt survive a mis-typed curl flag. The real requirement was: if an engineer turns staging into a dumpster fire at 3 am, nothing outside staging should notice. We also needed to ship a new subscription checkout flow for creators in countries where PayPal blocks transactions. The flow had to store settlement schedules, retry failed charges, and emit events to Kafka so analytics could bill creators in USD without touching the blocked jurisdiction. The first cut used DynamoDB with on-demand capacity and TTLs, but the finance team vetoed it because the eventual-consistency model could under-charge a creator in Kazakhstan by 0.03 USD and we wouldnt know for 12 hours.

What We Tried First (And Why It Failed)

We started with a Terraform module for RDS Postgres 14, parameter group set to db.t3.medium, publicly accessible = true, and storage_encrypted = false. It passed the linting stage because the linter only checked for AWS tags. We deployed staging with terraform apply -auto-approve -var environment=staging. Two weeks later an intern ran a chaos experiment that killed the master node; Prometheus screamed about 503s on /health but the auto-scaling policy had cooldown = 300 seconds and the replacement node took 7 minutes to come up because the init script downloaded 1.2 GB of fonts for a demo dashboard. The payment service still didnt retry DNS, so the first 480 requests failed.

We also tried a separate staging Kafka cluster (msk.t3.small, three brokers) to test exactly-once semantics. The topic auto-created with retention.ms=604800000, which meant a single misconfigured producer could retain a week of test messages and run the cluster out of disk in six hours. Kafka Manager showed the disk usage in red, but the on-call rotation had no threshold rule for broker disk < 30 GB. The staging alert router pointed to a Slack webhook that posted to #staging-alerts; the channel had 477 muted threads and the message was scrolled away before anyone noticed.

The Architecture Decision

We rebuilt staging to be disposable and observable.

First, we replaced the Terraform RDS module with an ephemeral Postgres in Kubernetes using the zalando/postgres-operator. The operator runs a primary pod with a sidecar pgbouncer, a standby set, and a volume snapshot every hour. The snapshot lands in an S3 bucket with SSE-KMS, so even if the cluster melts, a simple kubectl apply -f restore.yaml brings it back in five minutes. We turned publicly accessible off and added an AWS PrivateLink endpoint so the Kubernetes cluster could reach the DB without an internet gateway. The terraform apply now uses –target=module.vpc –target=module.eks first to prevent route table race conditions.

Second, we moved the checkout retry logic from HTTP to a durable queue. Instead of the legacy PHP helper that multiplied null by a million, we wrote a Go service that consumes from an SQS FIFO queue named checkout-payments.fifo with message group ID set to the creators UUID. The queue has visibility timeout 300 seconds and maximum receive count 3. The service writes to Postgres in a transaction, then publishes a Kafka event only after commit. We disabled TTL on the queue because we never want a subscription schedule to evaporate during an outage. The service exports a Prometheus metric checkout_service_retry_count{queue=fifo} that fires an alert if the count > 5 per minute.

Third, we built a nightly chaos pipeline that runs in staging: it randomly kills the Postgres primary, injects 500 ms latency on the PrivateLink, and rewrites a checkout row to status=failed_then_retry=true. The pipeline is scheduled by GitHub Actions at 02:00 UTC and posts results to a dedicated Slack channel. If the primary doesnt recover within five minutes, the pipeline pages the on-call rotation. The chaos job uses terraform destroy -target=module.staging_db followed by terraform apply to ensure


Treated the payment platform as infrastructure. Found the single point of failure. This is the replacement I put in place: https://payhip.com/ref/dev4