惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

O
OpenAI News
博客园 - 司徒正美
阮一峰的网络日志
阮一峰的网络日志
酷 壳 – CoolShell
酷 壳 – CoolShell
The Hacker News
The Hacker News
罗磊的独立博客
L
LINUX DO - 热门话题
D
Darknet – Hacking Tools, Hacker News & Cyber Security
宝玉的分享
宝玉的分享
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Attack and Defense Labs
Attack and Defense Labs
Apple Machine Learning Research
Apple Machine Learning Research
大猫的无限游戏
大猫的无限游戏
博客园 - 叶小钗
博客园 - 聂微东
The Last Watchdog
The Last Watchdog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Latest news
Latest news
美团技术团队
Hacker News: Ask HN
Hacker News: Ask HN
J
Java Code Geeks
V
Visual Studio Blog
L
LINUX DO - 最新话题
Recent Commits to openclaw:main
Recent Commits to openclaw:main
博客园 - 【当耐特】
AWS News Blog
AWS News Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
S
Schneier on Security
L
Lohrmann on Cybersecurity
Security Archives - TechRepublic
Security Archives - TechRepublic
S
Security Affairs
T
Threatpost
博客园_首页
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
腾讯CDC
博客园 - 三生石上(FineUI控件)
V
V2EX
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
Jina AI
Jina AI
雷峰网
雷峰网
Know Your Adversary
Know Your Adversary
有赞技术团队
有赞技术团队
V2EX - 技术
V2EX - 技术
Scott Helme
Scott Helme
N
News | PayPal Newsroom
Simon Willison's Weblog
Simon Willison's Weblog
H
Hacker News: Front Page
月光博客
月光博客
小众软件
小众软件

IT Notes - stories

暂无文章

IT Notes
Stefano Marinelli · 2024-12-28 · via IT Notes - stories

Yesterday afternoon, a developer notified me that Sentry is down again. I check: 200 gigabytes of database filled in just a few hours. And it reminds me of the night, years ago, when I almost "died" because of a full Sentry.

A client had launched a new service with great fanfare. After a few days, the attention was through the roof. While some of his devs were pushing to move "to the Cloud", he had decided against it. He always had a vision similar to mine. Thanks to his competence and clear ideas, he had managed to convince several investors to sponsor the project. Things were going well. This person was highly skilled, both technically and administratively, but, after the initial development, he became very absorbed in everything surrounding the project.

The two dedicated servers in place were handling the load with ease, even under heavy traffic. Just over €100 per month for thousands of concurrent users. Good implementation, good optimization, backups every 5 minutes via zfs-send and zfs-receive to a third backup server.

The only issue: some devs had requested Sentry for monitoring. In this case too, we opted for a self-hosted setup. A dedicated server was set up just for Sentry: 2 TB of fast storage. After a week, it had used just 2 GB of disk space. Everything was fine, I was feeling confident. But...

...one afternoon, something went wrong. Suddenly, the web apps on both dedicated servers became extremely slow. Then unreachable, despite their minimal load. Panic ensued. Some devs started chanting the usual mantra, "we need more powaaaar!!!" but, in reality, the servers were under zero load but had a ton of outgoing network connections.

I investigated and realized that everything was grinding to a halt when calls to Sentry were made. I logged into the Sentry server: full. And this wouldn't have been an issue if it weren't for the fact that the production apps would freeze if Sentry didn’t respond. It had been implemented in a blocking way, with a timeout of 300 seconds. The notifications about that server were sent directly to the devs, but it seems that the common alias they provided me actually pointed to nothing more than a test mailbox, which was abandoned right after the account was created.

The project manager identified the issue as a deployment made just a few hours earlier by a dev. They rolled back the changes. Everything returned to normal. We chalked it up as a bad experience. The dev in question began arguing that we had taken the wrong approach because, in the cloud, "this wouldn’t have happened". I thought to myself (but kept it to myself) that if he had written proper code, this wouldn't have happened either. But in his eyes, I was probably just the sysadmin being a pain and judging the "quality" of his work. In the cloud, much of this is often hidden by autoscaling or other abstractions.

Two days later, I had a high fever. The doctor prescribed some medication, but I had a pretty bad allergic reaction to one of them. By the evening, I was feeling awful and went to bed. I was exhausted, so I fell asleep immediately. In the middle of the night, a series of notifications woke me up: everything was down again. Even though I was sick, I got out of bed and rushed to the computer. Another deployment had been made two hours earlier, again by that same dev. Sentry was down again. Full again.

This time, I acted urgently and independently. I shut down Sentry and configured the reverse proxy to respond with a "200" to every call, just to keep the application running. Emptying Sentry would have been complicated, given that a vacuum full in PostgreSQL requires space - which we didn’t have - and time. But I wasn't feeling well. Not at all. I got up, and my wife followed me, seeing how unwell I was. I didn't even check if the alerts had cleared; I ran to the bathroom. That's all I remember. My wife heard a thud; I had fainted and hit my head on the floor. Luckily, I had been sitting when I fell, so the bump wasn't too bad. Shortly after, I woke up. The first thing my wife, worried but knowing me well, said was: "The alerts are clear; the servers are up". I relaxed. I had a double espresso, pulled myself together, and went back to bed, though I couldn't sleep.

I sent an email to the project manager and told him: I can't keep this setup running. I'm not capable of jumping out of bed every time Sentry crashes and everything stops because of it. It fills up too quickly, and I don't think there's any reason to log and retain that much data. Even less so for the calls to be blocking. In my opinion, that whole part of the development needs to be reviewed.

The dev kept insisting that we needed to move to the cloud to avoid these issues. I pointed out that in the cloud, it would have been the same unless we set extremely high limits. In that case, it would have been costly, and we'd still just be kicking the problem down the road. Code needs to be written properly; you can't just waste money and resources endlessly to cover up inefficiencies.

Since replacing the dev wasn't an option for various reasons, they decided to humor him and moved Sentry to a cloud instance, partly to free me from the situation. I was happy to only manage the production servers. It's funny how nowadays, "metrics" and "error tracking" seem to matter more than service stability and efficiency. Anyway, they didn't set strict limits - but they did fix the code. But... a few days later, the dev once again made an "incorrect" commit, and the wild logging resumed. This time, no one noticed: the cloud provider's alerts were ignored, ending up in that secondary project mailbox which, as it turned out, no one checked anymore after the accounts were created.

This went on for a week. Then two. Then three. The dev, at this point, sent an email "boasting" about how his idea of moving to the cloud had been successful for service continuity. After a month, the bill arrived. No one had realized what was happening. They never told me the amount, but I know a good portion of the remaining budget for the project's development and promotion went up in smoke. And when an investor found out what had happened and how all that money had been burned, they pulled out. I can only imagine: if in the early days they filled 2 TB in just a few hours, with traffic increasing daily over a month...

In the end, the project failed, and this event, while not the sole cause, significantly undermined the credibility of how the money was managed.

Basically, I almost died for nothing. :-D