IT Notes

首页发现

IT Notes - horrorstories

IT Notes

Stefano Marinelli · 2024-12-28 · via IT Notes - horrorstories

Yesterday afternoon, a developer notified me that Sentry is down again. I check: 200 gigabytes of database filled in just a few hours. And it reminds me of the night, years ago, when I almost "died" because of a full Sentry.

A client had launched a new service with great fanfare. After a few days, the attention was through the roof. While some of his devs were pushing to move "to the Cloud", he had decided against it. He always had a vision similar to mine. Thanks to his competence and clear ideas, he had managed to convince several investors to sponsor the project. Things were going well. This person was highly skilled, both technically and administratively, but, after the initial development, he became very absorbed in everything surrounding the project.

The two dedicated servers in place were handling the load with ease, even under heavy traffic. Just over €100 per month for thousands of concurrent users. Good implementation, good optimization, backups every 5 minutes via zfs-send and zfs-receive to a third backup server.

The only issue: some devs had requested Sentry for monitoring. In this case too, we opted for a self-hosted setup. A dedicated server was set up just for Sentry: 2 TB of fast storage. After a week, it had used just 2 GB of disk space. Everything was fine, I was feeling confident. But...

...one afternoon, something went wrong. Suddenly, the web apps on both dedicated servers became extremely slow. Then unreachable, despite their minimal load. Panic ensued. Some devs started chanting the usual mantra, "we need more powaaaar!!!" but, in reality, the servers were under zero load but had a ton of outgoing network connections.

I investigated and realized that everything was grinding to a halt when calls to Sentry were made. I logged into the Sentry server: full. And this wouldn't have been an issue if it weren't for the fact that the production apps would freeze if Sentry didn’t respond. It had been implemented in a blocking way, with a timeout of 300 seconds. The notifications about that server were sent directly to the devs, but it seems that the common alias they provided me actually pointed to nothing more than a test mailbox, which was abandoned right after the account was created.

The project manager identified the issue as a deployment made just a few hours earlier by a dev. They rolled back the changes. Everything returned to normal. We chalked it up as a bad experience. The dev in question began arguing that we had taken the wrong approach because, in the cloud, "this wouldn’t have happened". I thought to myself (but kept it to myself) that if he had written proper code, this wouldn't have happened either. But in his eyes, I was probably just the sysadmin being a pain and judging the "quality" of his work. In the cloud, much of this is often hidden by autoscaling or other abstractions.

Two days later, I had a high fever. The doctor prescribed some medication, but I had a pretty bad allergic reaction to one of them. By the evening, I was feeling awful and went to bed. I was exhausted, so I fell asleep immediately. In the middle of the night, a series of notifications woke me up: everything was down again. Even though I was sick, I got out of bed and rushed to the computer. Another deployment had been made two hours earlier, again by that same dev. Sentry was down again. Full again.

This time, I acted urgently and independently. I shut down Sentry and configured the reverse proxy to respond with a "200" to every call, just to keep the application running. Emptying Sentry would have been complicated, given that a vacuum full in PostgreSQL requires space - which we didn’t have - and time. But I wasn't feeling well. Not at all. I got up, and my wife followed me, seeing how unwell I was. I didn't even check if the alerts had cleared; I ran to the bathroom. That's all I remember. My wife heard a thud; I had fainted and hit my head on the floor. Luckily, I had been sitting when I fell, so the bump wasn't too bad. Shortly after, I woke up. The first thing my wife, worried but knowing me well, said was: "The alerts are clear; the servers are up". I relaxed. I had a double espresso, pulled myself together, and went back to bed, though I couldn't sleep.

I sent an email to the project manager and told him: I can't keep this setup running. I'm not capable of jumping out of bed every time Sentry crashes and everything stops because of it. It fills up too quickly, and I don't think there's any reason to log and retain that much data. Even less so for the calls to be blocking. In my opinion, that whole part of the development needs to be reviewed.

The dev kept insisting that we needed to move to the cloud to avoid these issues. I pointed out that in the cloud, it would have been the same unless we set extremely high limits. In that case, it would have been costly, and we'd still just be kicking the problem down the road. Code needs to be written properly; you can't just waste money and resources endlessly to cover up inefficiencies.

Since replacing the dev wasn't an option for various reasons, they decided to humor him and moved Sentry to a cloud instance, partly to free me from the situation. I was happy to only manage the production servers. It's funny how nowadays, "metrics" and "error tracking" seem to matter more than service stability and efficiency. Anyway, they didn't set strict limits - but they did fix the code. But... a few days later, the dev once again made an "incorrect" commit, and the wild logging resumed. This time, no one noticed: the cloud provider's alerts were ignored, ending up in that secondary project mailbox which, as it turned out, no one checked anymore after the accounts were created.

This went on for a week. Then two. Then three. The dev, at this point, sent an email "boasting" about how his idea of moving to the cloud had been successful for service continuity. After a month, the bill arrived. No one had realized what was happening. They never told me the amount, but I know a good portion of the remaining budget for the project's development and promotion went up in smoke. And when an investor found out what had happened and how all that money had been burned, they pulled out. I can only imagine: if in the early days they filled 2 TB in just a few hours, with traffic increasing daily over a month...

In the end, the project failed, and this event, while not the sole cause, significantly undermined the credibility of how the money was managed.

Basically, I almost died for nothing. :-D

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

IT Notes - horrorstories