惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V
Visual Studio Blog
MongoDB | Blog
MongoDB | Blog
Engineering at Meta
Engineering at Meta
云风的 BLOG
云风的 BLOG
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog RSS Feed
T
The Exploit Database - CXSecurity.com
P
Privacy & Cybersecurity Law Blog
Know Your Adversary
Know Your Adversary
月光博客
月光博客
I
InfoQ
阮一峰的网络日志
阮一峰的网络日志
NISL@THU
NISL@THU
爱范儿
爱范儿
S
Securelist
博客园 - 叶小钗
C
CERT Recently Published Vulnerability Notes
Recorded Future
Recorded Future
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
aimingoo的专栏
aimingoo的专栏
D
DataBreaches.Net
G
GRAHAM CLULEY
P
Proofpoint News Feed
A
About on SuperTechFans
Google DeepMind News
Google DeepMind News
C
Cyber Attacks, Cyber Crime and Cyber Security
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Tor Project blog
Stack Overflow Blog
Stack Overflow Blog
T
Threat Research - Cisco Blogs
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
Hugging Face - Blog
Hugging Face - Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Recent Announcements
Recent Announcements
P
Proofpoint News Feed
The GitHub Blog
The GitHub Blog
The Cloudflare Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
Jina AI
Jina AI
大猫的无限游戏
大猫的无限游戏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
罗磊的独立博客
博客园 - 【当耐特】
H
Help Net Security
F
Fortinet All Blogs
T
The Blog of Author Tim Ferriss

Posts on Noah Bailey

How to turn anything into a router Deploy to Cloudfront from GitHub using OpenID Connect Backup Postgres databases with Kubernetes CronJobs The spelling error made 200 billion times a day Restarting Kubernetes pods using a CronJob You've just bought a new domain. Now what? Who Sawed My Motherboard??? Linux on the P8 Aliexpress Mini Laptop Recovering Mysql/Mariadb after a nasty crash Using EXIF data to pick my next lens Converting and developing RAW photos on Linux automatically Thank you, 2016 iPhone Don't Make It Work Self-hosted Surveillance with ZoneMinder Backups, Monitoring, and Security for small Mastodon servers Block web scanners with ipset & iptables Executing commands over SSH with GitHub Actions Debian Sid on encrypted ZFS Protect your dangerously insecure redis server Debian: the luxurious boring lifestyle Monitor radiation with a Raspberry Pi Simple Linux server alerts: Know your performance, errors, security, syslog, and security NUC crashes on debian 11 - How I fixed it Basic Linux server security with fail2ban, ossec, and firewall Windows 11 will create heaps of needless trash Domesticated Kubernetes Networking The Cursed Certificate Our mostly disposable and entirely stupid world Trying out OpenBSD (as a Linux geek) Making VoIP Calls with Antique Rotary Phones Monitoring WAN speed with speedtest-cli and ElasticSearch Monitoring WAN latency with InfluxDB The Zeroshell botnet returns Installing Gentoo on a vintage Thinkpad T60 Malware emails 2: Russian boogaloo TP-Link Device Weirdness A New Botnet is Targeting Network Infrastructure Malware on the Wire: Monitoring Network Traffic with Suricata and ClamAV Cloud Threat Protection with OSSEC and Suricata Malware Emails From Jerks Surviving the Apocalypse with an Offline Wikipedia Server Being Attacked by Bots Linux Router, Firewall and IDS Appliance You Probably Don't Need a VPN Fix an Oversharded Elasticsearch Cluster Automating KVM Virtualization Update all your linux servers as fast as possible Cleanup Systemd Journald Storage Stop Putting Your SSH Keys on Github! Clustering KVM with Ceph Storage Stealing Windows Sessions FreeRadius Active Directory Integration Retrieving WPA2 Keys on Windows Deploy MDT Litetouch on Linux with TFTPD and Syslinux Generating MSI transform files with Orca The Inflatable Dinghy Generating Cisco IOS config files with Python Homebrew SAN Getting Cloudy
ElasticSearch broke all my nice things (a story of cascading failure)
2020-09-03 · via Posts on Noah Bailey

About two weeks ago, I upgraded my single node ElasticSearch cluster from 6.8.6 to the latest 7.9 version. Last night, all hell broke loose…

The upgrade itself wasn’t perfect. There were some issues with my setup that the helpful “Upgrade Assistant” didn’t pick up before I had already committed. I was missing a few formerly optional parameters in my elasticsearch.yml config file, there were some odd field mappings that weren’t supported any more, and some date format issues with my grok scripts. But, after a few head scratches and some furious keyboard abuse, my system was back up and running, better than ever!

Well, as it turns out, it wasn’t exactly in good shape. In the name of progress, ElasticSearch feverishly deprecates previously recommended options. A perfectly valid configuration from only a few versions ago seems to now be ground zero for a complete showstopper.

In my case, the issue was that the index template created by Logstash (a very long time ago) was no longer valid because of a breaking change. The reason that this issue took so long to surface is because this particular index rotates on a monthly basis to keep the shards nice and big. During the update process there was no warning message or error that suggested that this was the case.

The next morning, I noticed that there weren’t any new security alerts. That was strange, because there’s always some new and exciting Mirai botnet variant poking around. Later that day I noticed that my Nextcloud was unresponsive when uploading photos. By the evening when work was done, I did some investigation. What I found was that Logstash and Elasticsearch had broken pretty much everything.

The first clue was the steadly increasing storage chart for my elastic virtual machine going from about 84% to 100% full in only a couple hours. Curious, I ssh’ed in to check. What I found was some absolutely monstrous syslog files, possibly some of the largest I’ve ever seen:

$ ls -lah /var/log/syslog*
-rw-r----- 1 syslog adm  11G Sep  1 17:40 /var/log/syslog
-rw-r----- 1 syslog adm  14G Sep  1 06:25 /var/log/syslog.1

Using less I looked into each file. Inside both were messages much like this duplicated millions and millions of times:

Sep  1 06:25:04 elastic logstash[873]: [2020-09-01T06:25:04,052][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2020.09", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x109be9b2>], :response=>{"index"=>{"_index"=>"logstash-2020.09", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"[_default_] mappings are not allowed on new indices and should no longer be used. See [https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#default-mapping-not-allowed] for more information."}}}}

I surmised that each of these log entries was created by one attempt to create the new index. Each attempt to create the new index was caused by one event processed by Logstash. And, of course, each event was a syslog message, an httpd log, or a suricata network event. Naturally, I have many systems that auto-update at midnight on the 1st of the month followed by a reboot. Each of these triggers quite a bit of network traffic and therefore creates many suricata logs. Thus, the perfect storm was created, dozens of devices updating caused billions of error logs to fill up my logging server.

After reading the linked document, I determined that the fastest way to get back up and running was to simply stop all the services, remove the index template that Logstash had created, then allow it to re-generate it on first startup. It was a fairly quick fix:

sudo systemctl stop logstash
sudo systemctl stop elasticsearch

curl -X DELETE  localhost:9200/_template/logstash

sudo systemctl start elasticsearch
sleep 120  #coffee break
sudo systemctl start logstash

Soon after, the system was back up and running. However, it turned out that there was more damage than I first thought.

In addition to Suricata network IDS, I also use Wazuh host IDS. Wazuh is based on Ossec, a trusty alerting engine for the paranoid and security-minded sysadmin. Part of ossec’s functionality is to read system logs in real time to detect potentially malicious activity. Normally this is a very useful feature, but that fateful night it caused the death of a second system.

You see, each time a message logged containing the word error, an event is forwarded to the Ossec server. This is a good thing for it to do, since errors are usually good to know about. In this case, there were billions of errors, so the effort of decoding and indexing all of these events caused tremendous IO load on both the ossec server VM and the underlying KVM hypervisor. By about 09:00, this disk had also filled up with logs.

Now, most homelabbers thin provision. It’s the dirty secret we tell ourselves is okay. In my case, both my KVM systems were overprovisioned. When the syslog hurricane hit, both VMs on their respective hosts hit the lid on storage, causing all the other VMs to lock up after their devices were put into read-only mode. Once I had deleted some old indices and syslogs, I was able to shrink both qcow disks using qemu-img, de-ballooning the drives and unlocking IO for the rest of my machines.

By the end of this ordeal, I’ve learned some important lessons about overprovisioning and the very limited performance of cheap Celeron NUCs. I’ve probably done permanent damage to the cheap off-brand SSDs as well, but that was always a risk I was willing to take.

But really, my frustrations are directed at Elastic Stack. I don’t understand why it’s necessary to rapidly deprecate functionality that was once default so quickly, and I’m very frustrated by the lack of error rate limiting in Logstash. All of this could have been avoided if Elastic services were more focused on stability over flashy features that most deployments don’t use.

I’m considering moving away from ELK stack after this. It’s caused too many issues for what it gives me, and requires more maintenance than most of the actual applications I self-host.

As somebody in the software industry said, “your competition isn’t competitors, it’s Microsoft Excel”. I think in this case, the competition isn’t another log server, it’s grep itself.