惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V2EX - 技术
V2EX - 技术
L
LangChain Blog
IT之家
IT之家
S
SegmentFault 最新的问题
博客园 - 三生石上(FineUI控件)
H
Hackread – Cybersecurity News, Data Breaches, AI and More
T
The Blog of Author Tim Ferriss
Blog — PlanetScale
Blog — PlanetScale
N
Netflix TechBlog - Medium
U
Unit 42
B
Blog RSS Feed
GbyAI
GbyAI
Microsoft Security Blog
Microsoft Security Blog
博客园 - 司徒正美
Apple Machine Learning Research
Apple Machine Learning Research
T
Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
The Register - Security
The Register - Security
Vercel News
Vercel News
S
Schneier on Security
Spread Privacy
Spread Privacy
C
Cyber Attacks, Cyber Crime and Cyber Security
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
博客园 - 叶小钗
雷峰网
雷峰网
博客园_首页
人人都是产品经理
人人都是产品经理
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
T
Tor Project blog
L
Lohrmann on Cybersecurity
Know Your Adversary
Know Your Adversary
D
Darknet – Hacking Tools, Hacker News & Cyber Security
C
Cybersecurity and Infrastructure Security Agency CISA
P
Privacy International News Feed
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Tenable Blog
V
Vulnerabilities – Threatpost
大猫的无限游戏
大猫的无限游戏
博客园 - 【当耐特】
V
V2EX
Security Latest
Security Latest
A
About on SuperTechFans
Cloudbric
Cloudbric
S
Security Affairs
MongoDB | Blog
MongoDB | Blog
Y
Y Combinator Blog
Martin Fowler
Martin Fowler
TaoSecurity Blog
TaoSecurity Blog

Posts on Noah Bailey

How to turn anything into a router Deploy to Cloudfront from GitHub using OpenID Connect Backup Postgres databases with Kubernetes CronJobs The spelling error made 200 billion times a day Restarting Kubernetes pods using a CronJob You've just bought a new domain. Now what? Who Sawed My Motherboard??? Linux on the P8 Aliexpress Mini Laptop Recovering Mysql/Mariadb after a nasty crash Using EXIF data to pick my next lens Converting and developing RAW photos on Linux automatically Thank you, 2016 iPhone Don't Make It Work Self-hosted Surveillance with ZoneMinder Backups, Monitoring, and Security for small Mastodon servers Block web scanners with ipset & iptables Executing commands over SSH with GitHub Actions Debian Sid on encrypted ZFS Protect your dangerously insecure redis server Debian: the luxurious boring lifestyle Monitor radiation with a Raspberry Pi Simple Linux server alerts: Know your performance, errors, security, syslog, and security NUC crashes on debian 11 - How I fixed it Basic Linux server security with fail2ban, ossec, and firewall Windows 11 will create heaps of needless trash Domesticated Kubernetes Networking The Cursed Certificate Our mostly disposable and entirely stupid world Trying out OpenBSD (as a Linux geek) Making VoIP Calls with Antique Rotary Phones Monitoring WAN speed with speedtest-cli and ElasticSearch Monitoring WAN latency with InfluxDB The Zeroshell botnet returns Installing Gentoo on a vintage Thinkpad T60 Malware emails 2: Russian boogaloo TP-Link Device Weirdness A New Botnet is Targeting Network Infrastructure Malware on the Wire: Monitoring Network Traffic with Suricata and ClamAV Cloud Threat Protection with OSSEC and Suricata Malware Emails From Jerks Surviving the Apocalypse with an Offline Wikipedia Server Being Attacked by Bots Linux Router, Firewall and IDS Appliance You Probably Don't Need a VPN Fix an Oversharded Elasticsearch Cluster Automating KVM Virtualization Update all your linux servers as fast as possible Cleanup Systemd Journald Storage Stop Putting Your SSH Keys on Github! Clustering KVM with Ceph Storage Stealing Windows Sessions FreeRadius Active Directory Integration Retrieving WPA2 Keys on Windows Deploy MDT Litetouch on Linux with TFTPD and Syslinux Generating MSI transform files with Orca The Inflatable Dinghy Generating Cisco IOS config files with Python Homebrew SAN Getting Cloudy
ElasticSearch broke all my nice things (a story of cascading failure)
2020-09-03 · via Posts on Noah Bailey

About two weeks ago, I upgraded my single node ElasticSearch cluster from 6.8.6 to the latest 7.9 version. Last night, all hell broke loose…

The upgrade itself wasn’t perfect. There were some issues with my setup that the helpful “Upgrade Assistant” didn’t pick up before I had already committed. I was missing a few formerly optional parameters in my elasticsearch.yml config file, there were some odd field mappings that weren’t supported any more, and some date format issues with my grok scripts. But, after a few head scratches and some furious keyboard abuse, my system was back up and running, better than ever!

Well, as it turns out, it wasn’t exactly in good shape. In the name of progress, ElasticSearch feverishly deprecates previously recommended options. A perfectly valid configuration from only a few versions ago seems to now be ground zero for a complete showstopper.

In my case, the issue was that the index template created by Logstash (a very long time ago) was no longer valid because of a breaking change. The reason that this issue took so long to surface is because this particular index rotates on a monthly basis to keep the shards nice and big. During the update process there was no warning message or error that suggested that this was the case.

The next morning, I noticed that there weren’t any new security alerts. That was strange, because there’s always some new and exciting Mirai botnet variant poking around. Later that day I noticed that my Nextcloud was unresponsive when uploading photos. By the evening when work was done, I did some investigation. What I found was that Logstash and Elasticsearch had broken pretty much everything.

The first clue was the steadly increasing storage chart for my elastic virtual machine going from about 84% to 100% full in only a couple hours. Curious, I ssh’ed in to check. What I found was some absolutely monstrous syslog files, possibly some of the largest I’ve ever seen:

$ ls -lah /var/log/syslog*
-rw-r----- 1 syslog adm  11G Sep  1 17:40 /var/log/syslog
-rw-r----- 1 syslog adm  14G Sep  1 06:25 /var/log/syslog.1

Using less I looked into each file. Inside both were messages much like this duplicated millions and millions of times:

Sep  1 06:25:04 elastic logstash[873]: [2020-09-01T06:25:04,052][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2020.09", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x109be9b2>], :response=>{"index"=>{"_index"=>"logstash-2020.09", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"[_default_] mappings are not allowed on new indices and should no longer be used. See [https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#default-mapping-not-allowed] for more information."}}}}

I surmised that each of these log entries was created by one attempt to create the new index. Each attempt to create the new index was caused by one event processed by Logstash. And, of course, each event was a syslog message, an httpd log, or a suricata network event. Naturally, I have many systems that auto-update at midnight on the 1st of the month followed by a reboot. Each of these triggers quite a bit of network traffic and therefore creates many suricata logs. Thus, the perfect storm was created, dozens of devices updating caused billions of error logs to fill up my logging server.

After reading the linked document, I determined that the fastest way to get back up and running was to simply stop all the services, remove the index template that Logstash had created, then allow it to re-generate it on first startup. It was a fairly quick fix:

sudo systemctl stop logstash
sudo systemctl stop elasticsearch

curl -X DELETE  localhost:9200/_template/logstash

sudo systemctl start elasticsearch
sleep 120  #coffee break
sudo systemctl start logstash

Soon after, the system was back up and running. However, it turned out that there was more damage than I first thought.

In addition to Suricata network IDS, I also use Wazuh host IDS. Wazuh is based on Ossec, a trusty alerting engine for the paranoid and security-minded sysadmin. Part of ossec’s functionality is to read system logs in real time to detect potentially malicious activity. Normally this is a very useful feature, but that fateful night it caused the death of a second system.

You see, each time a message logged containing the word error, an event is forwarded to the Ossec server. This is a good thing for it to do, since errors are usually good to know about. In this case, there were billions of errors, so the effort of decoding and indexing all of these events caused tremendous IO load on both the ossec server VM and the underlying KVM hypervisor. By about 09:00, this disk had also filled up with logs.

Now, most homelabbers thin provision. It’s the dirty secret we tell ourselves is okay. In my case, both my KVM systems were overprovisioned. When the syslog hurricane hit, both VMs on their respective hosts hit the lid on storage, causing all the other VMs to lock up after their devices were put into read-only mode. Once I had deleted some old indices and syslogs, I was able to shrink both qcow disks using qemu-img, de-ballooning the drives and unlocking IO for the rest of my machines.

By the end of this ordeal, I’ve learned some important lessons about overprovisioning and the very limited performance of cheap Celeron NUCs. I’ve probably done permanent damage to the cheap off-brand SSDs as well, but that was always a risk I was willing to take.

But really, my frustrations are directed at Elastic Stack. I don’t understand why it’s necessary to rapidly deprecate functionality that was once default so quickly, and I’m very frustrated by the lack of error rate limiting in Logstash. All of this could have been avoided if Elastic services were more focused on stability over flashy features that most deployments don’t use.

I’m considering moving away from ELK stack after this. It’s caused too many issues for what it gives me, and requires more maintenance than most of the actual applications I self-host.

As somebody in the software industry said, “your competition isn’t competitors, it’s Microsoft Excel”. I think in this case, the competition isn’t another log server, it’s grep itself.