惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

文章列表

Compulsive curiosity, or, how I built an infinite idea machine Gift details on the subscriber portal Portal link in the archive nav The physicists who convinced Fermilab to send Brazil's emails First, add no friction: How micropayments lost and subscriptions won Filter subscribers and automations by source Automations, rebuilt What email will look like in the future Filter subscribers by bounce date and reason Email could have been X.400 times better Three features are moving behind the paywall Firewall changes and improvements Put your name and voice into your company newsletter Simplified email address settings Subscription wall Inboxes were overwhelming before we'd even named them The US government tried really hard to screw up email Ask a nerd: what is the best way to unsubscribe from newsletters? Bookshop.org embeds Email was into agents before they were cool Passwordless login Rename metadata keys in bulk A spring cleaning for our legal docs Ask a nerd: what happens when you click the spam button? Passkey support for two-factor authentication How Buttondown's API versioning works Safer defaults for the email creation API How to send email to space How we enabled Content Security Policy for everyone Recovery codes for two-factor authentication Filter sent emails by engagement rate How we migrated to TypeIDs without breaking clients How we check every link in your email Use newsletter metadata in your emails Should we bring back email exploders? Sort and filter by open and click rates Custom click tracking domains More newsletter settings in the API Revamped replies Custom email templates for everyone Simplified cancellation Ask a Nerd: Does email length affect deliverability? The changelog, reborn Swedish localization Forwarding an email is not always straightforward Public descriptions for tags OpenAPI spec for archives How Rodrigo brings a humanistic view to consumer technology Subscribers can come from anywhere. Even another newsletter platform's form. Survey responses on the web How Brandon Lucas Green shares his music and supports artists Your newsletter's archives are more valuable than your list Better tag self-management Smarter automation filters Granular API keys Snippets New design settings pages Ask A Nerd: How does newsletter cadence affect deliverability? Starred views More ways to customize your archives Inbox filtering Mastodon follower analytics Ask a Nerd: What are good open, click, and response rates for an email newsletter? How we migrated our database to PlanetScale Two new archive themes Custom buttons now work in Markdown mode Ask a Nerd: Does attaching files to your newsletter hurt deliverability? Seline and Tinylytics support Unban subscribers Announcement bars for your archives Bang paths, source routing, and how email trips were planned Public postmortem: archive downtime 2025 disposables.app Russian localization Ask a Nerd: Can you improve email deliverability with a personal domain? More locale options How we interview customers at Buttondown Bluesky analytics Reply to conversations Minimum viable complexity How Jeffery Hicks goes behind-the-scenes in his newsletter Changes to our stack in 2025 2026: Emails What the hell is a UTM? TK reminders in the editor Randomize survey answer order Why we insourced analytics Scroll sync in the editor 2026: Archives How Jamie Thingelstad uses Buttondown to explore tech topics How Kelly Jensen uses Buttondown to discuss key library issues Keeping feature creep at bay Improved filters Content Security Policy in archives Open source Sniperl.ink Auto-activating RSS reader subscriptions What the hell is ActivityPub? Gift subscriptions How Igor Ranc built Berlin's largest expat tech newsletter
Public postmortem: database connection exhaustion
Justin Duke · 2026-04-01 · via

Yesterday (March 31st), Buttondown experienced two periods of downtime — approximately seven minutes and thirteen minutes, respectively — both stemming from the same root cause: database connection exhaustion.

To be specific: our database itself was healthy. Queries were fast, CPU and memory were fine, replication lag was nominal. The problem was simpler and, frankly, more embarrassing than that: we hit the ceiling on the number of connections our database was configured to accept. Once that ceiling was hit, new requests couldn't acquire a connection and failed.

First off, apologies for the disruption — particularly because this happened twice in one day.

How did we detect the issue?

This is where things get uncomfortable. Our health checker mostly reported things as fine. The health check endpoint returned 200s for the majority of requests because the checker's requests happened to land on workers that already held open connections. Think of it like a house party where the front door has collapsed: if you're already inside, or you know where the back door is, everything seems fine. Our external monitoring was essentially getting lucky — squeezing through just often enough to not trigger alerts.

We were alerted by user reports and our own manual observation, not by automated systems. That's not acceptable.

How did we mitigate the issue?

First incident: We identified the connection exhaustion, killed active queries to free up connections, and earmarked follow-up work for later. Downtime: ~7 minutes.

Second incident: Same root cause, but this time the connection count was so thoroughly saturated that we couldn't even connect to the database to kill queries. The tool we'd used to fix the first incident required the very resource that was exhausted. We had to restart the database to force-kill the ongoing queries. Downtime: ~13 minutes.

How will we prevent this from happening again?

Five things, roughly in order of "should have already existed" to "genuinely new investment":

  1. Reserved administrative connections. This is the single highest-leverage change. Most Postgres configurations support reserving a small number of connections specifically for administrative access. If we'd had even one reserved connection for ops, we could have killed queries during the second incident the same way we did during the first. The reason the second incident lasted nearly twice as long as the first is that we were locked out of our own fix. That won't happen again.

  2. Database-level alerting. We're adding direct monitoring on connection count as a percentage of the configured maximum. This is distinct from our end-to-end health checking — it doesn't care whether HTTP requests are succeeding. It watches the database itself and alerts when we're approaching capacity. By the time you read this, this should be live.

  3. Health check hardening. Our health check endpoint currently returns a static 200 — it doesn't actually verify that the process can acquire a database connection. We're changing it to attempt a lightweight query so that connection exhaustion is immediately visible to our health checker and load balancer. A health check that can't detect the most common failure mode isn't much of a health check.

  4. Connection headroom review. The configuration that was "too low" had been set a long time ago and never revisited as our traffic patterns changed. We're adding connection limits to our quarterly capacity review so this kind of slow drift doesn't catch us off guard again.

  5. Out-of-band recovery tooling. Beyond reserved connections, we're documenting and scripting a fallback for when even administrative access is blocked: force-recycling workers to release connections without needing a database connection at all. Not as surgical as killing individual queries, but it releases connections immediately and gives us a way out when nothing else works.