惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园_首页
Microsoft Security Blog
Microsoft Security Blog
云风的 BLOG
云风的 BLOG
B
Blog
The Register - Security
The Register - Security
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
N
Netflix TechBlog - Medium
F
Full Disclosure
The GitHub Blog
The GitHub Blog
Recorded Future
Recorded Future
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Blog — PlanetScale
Blog — PlanetScale
Jina AI
Jina AI
美团技术团队
宝玉的分享
宝玉的分享
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
G
Google Developers Blog
大猫的无限游戏
大猫的无限游戏
S
SegmentFault 最新的问题
D
DataBreaches.Net
Martin Fowler
Martin Fowler
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Google DeepMind News
Google DeepMind News
WordPress大学
WordPress大学
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - Franky
The Cloudflare Blog
博客园 - 【当耐特】
U
Unit 42
月光博客
月光博客
T
The Blog of Author Tim Ferriss
博客园 - 叶小钗
博客园 - 聂微东
I
InfoQ
B
Blog RSS Feed
Apple Machine Learning Research
Apple Machine Learning Research
Cyberwarzone
Cyberwarzone
V
V2EX
S
Securelist
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
S
Security @ Cisco Blogs
PCI Perspectives
PCI Perspectives
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
H
Heimdal Security Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Hacker News
The Hacker News
D
Darknet – Hacking Tools, Hacker News & Cyber Security
T
Tor Project blog

IT Notes - recovery

IT Notes IT Notes IT Notes IT Notes IT Notes IT Notes IT Notes IT Notes IT Notes
IT Notes
Stefano Marinelli · 2024-12-28 · via IT Notes - recovery

Yesterday afternoon, a developer notified me that Sentry is down again. I check: 200 gigabytes of database filled in just a few hours. And it reminds me of the night, years ago, when I almost "died" because of a full Sentry.

A client had launched a new service with great fanfare. After a few days, the attention was through the roof. While some of his devs were pushing to move "to the Cloud", he had decided against it. He always had a vision similar to mine. Thanks to his competence and clear ideas, he had managed to convince several investors to sponsor the project. Things were going well. This person was highly skilled, both technically and administratively, but, after the initial development, he became very absorbed in everything surrounding the project.

The two dedicated servers in place were handling the load with ease, even under heavy traffic. Just over €100 per month for thousands of concurrent users. Good implementation, good optimization, backups every 5 minutes via zfs-send and zfs-receive to a third backup server.

The only issue: some devs had requested Sentry for monitoring. In this case too, we opted for a self-hosted setup. A dedicated server was set up just for Sentry: 2 TB of fast storage. After a week, it had used just 2 GB of disk space. Everything was fine, I was feeling confident. But...

...one afternoon, something went wrong. Suddenly, the web apps on both dedicated servers became extremely slow. Then unreachable, despite their minimal load. Panic ensued. Some devs started chanting the usual mantra, "we need more powaaaar!!!" but, in reality, the servers were under zero load but had a ton of outgoing network connections.

I investigated and realized that everything was grinding to a halt when calls to Sentry were made. I logged into the Sentry server: full. And this wouldn't have been an issue if it weren't for the fact that the production apps would freeze if Sentry didn’t respond. It had been implemented in a blocking way, with a timeout of 300 seconds. The notifications about that server were sent directly to the devs, but it seems that the common alias they provided me actually pointed to nothing more than a test mailbox, which was abandoned right after the account was created.

The project manager identified the issue as a deployment made just a few hours earlier by a dev. They rolled back the changes. Everything returned to normal. We chalked it up as a bad experience. The dev in question began arguing that we had taken the wrong approach because, in the cloud, "this wouldn’t have happened". I thought to myself (but kept it to myself) that if he had written proper code, this wouldn't have happened either. But in his eyes, I was probably just the sysadmin being a pain and judging the "quality" of his work. In the cloud, much of this is often hidden by autoscaling or other abstractions.

Two days later, I had a high fever. The doctor prescribed some medication, but I had a pretty bad allergic reaction to one of them. By the evening, I was feeling awful and went to bed. I was exhausted, so I fell asleep immediately. In the middle of the night, a series of notifications woke me up: everything was down again. Even though I was sick, I got out of bed and rushed to the computer. Another deployment had been made two hours earlier, again by that same dev. Sentry was down again. Full again.

This time, I acted urgently and independently. I shut down Sentry and configured the reverse proxy to respond with a "200" to every call, just to keep the application running. Emptying Sentry would have been complicated, given that a vacuum full in PostgreSQL requires space - which we didn’t have - and time. But I wasn't feeling well. Not at all. I got up, and my wife followed me, seeing how unwell I was. I didn't even check if the alerts had cleared; I ran to the bathroom. That's all I remember. My wife heard a thud; I had fainted and hit my head on the floor. Luckily, I had been sitting when I fell, so the bump wasn't too bad. Shortly after, I woke up. The first thing my wife, worried but knowing me well, said was: "The alerts are clear; the servers are up". I relaxed. I had a double espresso, pulled myself together, and went back to bed, though I couldn't sleep.

I sent an email to the project manager and told him: I can't keep this setup running. I'm not capable of jumping out of bed every time Sentry crashes and everything stops because of it. It fills up too quickly, and I don't think there's any reason to log and retain that much data. Even less so for the calls to be blocking. In my opinion, that whole part of the development needs to be reviewed.

The dev kept insisting that we needed to move to the cloud to avoid these issues. I pointed out that in the cloud, it would have been the same unless we set extremely high limits. In that case, it would have been costly, and we'd still just be kicking the problem down the road. Code needs to be written properly; you can't just waste money and resources endlessly to cover up inefficiencies.

Since replacing the dev wasn't an option for various reasons, they decided to humor him and moved Sentry to a cloud instance, partly to free me from the situation. I was happy to only manage the production servers. It's funny how nowadays, "metrics" and "error tracking" seem to matter more than service stability and efficiency. Anyway, they didn't set strict limits - but they did fix the code. But... a few days later, the dev once again made an "incorrect" commit, and the wild logging resumed. This time, no one noticed: the cloud provider's alerts were ignored, ending up in that secondary project mailbox which, as it turned out, no one checked anymore after the accounts were created.

This went on for a week. Then two. Then three. The dev, at this point, sent an email "boasting" about how his idea of moving to the cloud had been successful for service continuity. After a month, the bill arrived. No one had realized what was happening. They never told me the amount, but I know a good portion of the remaining budget for the project's development and promotion went up in smoke. And when an investor found out what had happened and how all that money had been burned, they pulled out. I can only imagine: if in the early days they filled 2 TB in just a few hours, with traffic increasing daily over a month...

In the end, the project failed, and this event, while not the sole cause, significantly undermined the credibility of how the money was managed.

Basically, I almost died for nothing. :-D