惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Engineering at Meta
Engineering at Meta
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
有赞技术团队
有赞技术团队
人人都是产品经理
人人都是产品经理
腾讯CDC
Jina AI
Jina AI
I
InfoQ
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
宝玉的分享
宝玉的分享
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
S
SegmentFault 最新的问题
Blog — PlanetScale
Blog — PlanetScale
Stack Overflow Blog
Stack Overflow Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
美团技术团队
MyScale Blog
MyScale Blog
量子位

DEV Community

Terraform with AI: Build AWS Infra (Cursor + MCP) What If AI Didn’t Need the Internet? 750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek's Permanent Price Cut You're Renting Someone Else's Compute — And It's Costing You More Than You Think CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago Synaptic: A Local-First AI Dev Companion That Remembers How You Think Revolutionizing Edge MedTech: Building a Sovereign Sleep Apnea Companion ("XiHan Snore Coach") with Gemma 4 HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DMARC p=none vs p=quarantine vs p=reject: what to use and when DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I solo-built a reputation layer for AI agents on NEAR — and here's what I learned I built an AI faceless video generator in 2 months — here's the stack Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it) From Abandoned Prototype to Smart AI System: Reviving Trafiq AI with GitHub Copilot Why Country/State/City Pickers Are Weirdly Hard Node.js 22 LTS — EOL Date, Support Timeline, and What Comes Next The 7-Layer Memory Architecture Behind Modern AI Agents I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI One backend, four products: why we bet on platform-per-brand AI's tech debt is invisible — even to AI. I solved it at the architecture layer. Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals You Don’t Need to Try Every AI Tool to Keep Up NovelPilot: A Novel Writing Agent Powered by Gemma 4 BoxAgnts is an Out-Of-The-Box Secure AI Agent ToolBox in a WASM SandBox Gemma 4 deep dive: why a 1.5 GB model scores 37.5% on competition mathematics, how the MoE routing actually works, and which model fits your hardware. Full breakdown inside. BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090 Google Just Declared the Chat-Log Interface Dead. Here's What Neural Expressive Actually Signals for Developers. ARCHITECTURE SPECIFICATION & FORMAL SYSTEM REPORT: k501-AIONARC Notes from a Hammock What's Google Antigravity 2.0 ? Here's What the Agent Harness Actually Changes for Developers. Building an E2EE Chat App in Flask - Part 3: Keeping File Uploads Safe Google's Gemini Spark. Here's What It Actually Does for Developers. Microsoft Just Shipped MCP Governance for .NET. Here's What It Actually Enforces. How I Built a Pakistan Internet Speed Test Platform at 16 How to Build a Supervisor Agent Architecture Without Frameworks I Built My Own Corner of the Internet — Here's What It Looks Like How does VuReact compile Vue 3's defineExpose() to React? Neo-VECTR's Rift Ascent Idempotency Keys: The API Safety Net You Probably Aren't Using Building E-Commerce Sites for Niche Products: Technical Lessons from Specialty Outdoor Retailers Audit Logs: The Silent Guardian of Every Serious System Open-source SDS tooling for Japanese MHLW compliance: the gap nobody filled BetAGracevI I Built a Post-Quantum Cryptographic Identity SDK for AI Agents — Here's Why It Needs to Exist Running Claude Code across multiple repos without losing context There Are Cameras in Every Room of My House. I Put Them There. Why your AI agent loops forever (and how to break the cycle) How does VuReact compile Vue 3's defineSlots() to React? Building a Privacy-First Resume Editor with Typst WASM and React One Soul, Any Model: Portable Memory for Open-Source Agents with .klickd From Pixels to Prescriptions: Building an Autonomous Healthcare Booking Agent with LangGraph MonoGame - A Game Engine for Those Who Love Reinventing the Wheel # Day 24: In Solana, Everything is an Account Mastering Node.js HTTP Module: Build Servers, REST APIs, and Handle Requests Mastering Node.js HTTP Module: Build Servers, REST APIs, and Handle Requests RP2040 Wristwatch Tells Time With a Vintage VU Meter Needle observations about models / 2026, may From Video Transcripts to Source-Grounded AI Notes: A Practical Look at Notesnip AI Agent Dev Environment Guide — Real Experience from an AI Living Inside a Server How I Run 7 AI Models 24/7: Multi-Agent Architecture in Practice What exactly changes with the Claude Max plan? I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible OpenAI's $2M-tokens-for-equity YC deal, decoded Why DMX Infrastructure is Still Stuck in the 90s Agent Series (2): ReAct — The Most Important Agent Reasoning Paradigm Open Source Project (No.73): Sub2API - All-in-One Claude/OpenAI/Gemini Subscription-to-API Relay I Made the Wrong Bet on Event Streaming in Our Treasure Hunt Engine #ai #productivity #chatgpt #python Symbolic Constant Conundrum From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM Building an outbound-only WebSocket bridge for local AI agents Our System's Sins in Ghana: Why We Had to Rethink Digital Product Sales Execution Governance, AI Drift, and the Security Paradox of Runtime Enforcement Differential Pair Impedance: Why USB and HDMI Routing Is a Geometry Problem Small AI database questions can become big scans Claude Code 2.1 Agent View & /goal: Autonomous Dev Guide 2026 Your AI database agent should not see every column Rust's Low-Latency Conquest: Why We Ditched C++ for a Treasure Hunt Engine Floating-point will quietly corrupt your emissions math, and 0.1 + 0.2 already warned you Autonomous Agents: what breaks first (and why that's the real product) [2026-05-23] Agent payments are the new cloud bill footgun ORA-00069 오류 원인과 해결 방법 완벽 가이드 How I Built a Local, Multimodal Gemma 4 Visual Regression & Patch Agent: Closed-Loop Validation, Canvas Pixel Diffing, and Reproducible Benchmarks Pressure-testing Ota on Supabase: from setup prose to executable repo readiness VPC CNI en EKS: cómo dejar de pagar nodos que no usás The Future of Text Analysis: Introducing TechnoHelps Semantic Engine I built a Chrome Extension that saves product images + context directly to Google Drive & Sheets 95+ browser-based dev tools that never touch a server Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama From a 10,000-line OpenSearch export script to a log analysis tool Ghost Bugs Cost $40K: A Neural Debugging Postmortem SECPAC: A Lightweight CLI Tool to Password-Protect Your Environment Variables 🚀 PasteCheck v1.7 + v1.8 — Hints that tell you what to fix, and a nudge panel that tells you where to start 8 Real Ways Developers Make Money in 2026 (Ranked by Effort) I built a free AI-powered Git CLI that writes your commit messages for you
Five Clusters. Five Lessons. One Production System.
Amal Zachari · 2026-05-23 · via DEV Community

Most engineers talk about Kubernetes like it's something you set up once and forget. I set it up five times. Each time because the last version taught me something I couldn't have learned any other way.

This is that story.

The honest version of each stage:

Stage 1. I just need HA.

Stage 2. I just need cheaper HA.

Stage 3. I just need multi-cloud ARM scheduling.

Stage 4. I just need BGP mesh ingress.

Stage 5. I have achieved zero trust enlightenment.

Every stage felt like a small reasonable ask. The cluster had other ideas.

Stage 1: The Naive But Functional Cluster

The first cluster wasn't elegant. It wasn't supposed to be.

Three to five Hetzner dedicated nodes running k3s. Three DigitalOcean VPS running nginx as a makeshift load balancer. A DigitalOcean cloud load balancer sitting in front of those three VPS.

I followed Techno Tim's k3s HA install guide as the foundation. Solid starting point for getting a highly available cluster off the ground. Replaced his default Traefik ingress with nginx and added Rancher dashboard on top for visibility. Everything after that was documentation and experimentation.

Before this everything ran on plain VPS. Individual servers. No orchestration. Moving to Kubernetes meant migrating workloads one by one. Old VPS running stable, new cluster being tested in parallel, services moved gradually. If something broke I reverted. Nothing burned.

The cluster itself was stable. No memorable incidents with the nodes or networking. The one persistent irritation was MongoDB. Running Bitnami's MongoDB HA chart. One write replica, two read replicas. Randomly and without warning one of the pods would enter a crashloop. Logs pointed vaguely at lock contention but nothing actionable. A restart brought it back clean every time.

The read heavy workload saved me. A local cache layer absorbed most requests so when a replica went down users barely felt it. I'd notice eventually. Not through alerts, just observation. Restart the StatefulSet and move on.

It happened often enough to be annoying. Never often enough to be a crisis. The cluster was otherwise stable. This was just a ghost that came with the Bitnami chart.

It followed me into the next stage. And the one after that.

Cost pressure eventually killed this setup.

Stage 2: The WireGuard Mesh

Three DigitalOcean VPS plus one cloud load balancer to serve less than a thousand requests a month. Those thousand requests were me hitting the Rancher dashboard. Every actual workload was outbound only. Bots, task processors, queue consumers. Nothing needed to receive HTTP from the internet. Kubernetes was not there to serve traffic. The nginx tier existed for one admin panel. The math didn't make sense.

I wanted to move to Contabo. Not cheaper in absolute terms but more efficient for my actual workload requirements. Hetzner dedicated nodes were sitting with RAM mostly idle. I was paying for resources I wasn't using.

The problem was Contabo had no private network at the time.

So I built one.

WireGuard. I found it while researching private networking options. I looked at Kilo as well, a Kubernetes-native WireGuard mesh operator, but it felt too complex for what I needed. Netclient handled mesh generation automatically. WireGuard configurations for multiple peers without the manual key exchange headache.

A single DigitalOcean VPS became the entry point. Running the Netclient server and dashboard. All the Contabo nodes connected through it via mesh. First try. Everything came up clean.

The three nginx VPS and the cloud load balancer disappeared. Previously traffic flowed from Klipper through three nginx VPS to a DigitalOcean cloud load balancer before reaching users. An entire layer of infrastructure serving as a middleman. Replaced now by Klipper speaking directly to users. The elaborate routing chain had been architectural overkill from the start.

Contabo had weird reliability. Nodes would occasionally behave unexpectedly in ways that didn't make sense on paper. But the workload was tolerant to disruption and the cost saving from eliminating the most expensive components made the occasional weirdness acceptable.

The MongoDB crashloop followed. Still there. Still managed by cache and manual restarts. Not catastrophic. Just persistent.

Stage 3: ARM Workers and the First Multi-Cloud Cluster

ARM compute is cheap. Oracle's free tier ARM instances are extremely cheap. As in free.

A Network Chuck video appeared in my YouTube recommendations one day. Kubernetes on Raspberry Pi. Watched it casually. The part that stuck was his explanation of why he ran the Rancher dashboard on an amd64 node rather than the ARM64 ones. The reasoning was practical. Certain workloads and management tooling simply behaved better on amd64. That one observation shaped the architecture before I'd even started.

Master nodes stayed on amd64. Workers went on ARM64. Clean separation from the beginning.

I extended the Netclient WireGuard mesh to include Oracle ARM nodes in a different region. Multi-cloud before that was a common thing to talk about. Not for ideological reasons. Purely because free ARM compute was sitting there unused. The scale didn't justify multi-cloud architecture. The price did. Free compute is free compute.

Multi-arch builds handled through GoReleaser. On every GitHub commit, GitHub Actions triggered GoReleaser. Publishing multi-arch Docker images and binaries as releases automatically. Services with amd64-only dependencies got node affinity rules keeping them on the Contabo nodes. Cross-architecture scheduling handled through proper affinity configuration rather than hoping things landed in the right place.

Oracle's networking was strange. Occasionally a worker node would drift out of WireGuard mesh sync and require a restart. Never more than one at a time. Enough redundancy that it didn't cause visible issues. Whether that was a free tier limitation or something inherent to their networking I never definitively determined.

The MongoDB crashloop was still there. Quietly persistent across three different infrastructure setups now.

What ended it was a new client. Twenty Discord bots. Each one 512MB RAM at startup, 2GB at peak load. Discord.js cache is RAM hungry by design. The library caches guild members, messages, and state aggressively. Twenty bots at peak meant up to 40GB of RAM requirement for workloads that could not be disrupted.

Oracle free tier ARM nodes are not where you run non-disruptable RAM-intensive production workloads.

Time to go back to dedicated hardware.

Stage 4: Real High Availability

Five Hetzner nodes. Dual network. This is where the cluster started looking like something serious.

The private VLAN handled all internode communication. k3s configured via args to bind to the private IP. Not 0.0.0.0. Cluster join happened over private IPs not public ones. The public interface existed for egress only at the node level.

The goal with CNI was specific. Klipper, k3s's default service load balancer, requires port 80 and 443 open on every node to function. I didn't want that. I wanted a single advertised IP for ingress, not ports splayed across every node in the cluster.

Calico with BGP was the answer. Or so I thought.

The setup was architecturally sound. Five Hetzner dedicated nodes running a BGP mesh via Calico. MetalLB advertising a private IP range for load balancer services. That private range connected to Hetzner's cloud network via static routes. Each node acting as a router. A Hetzner cloud load balancer sitting in front, pointing to the MetalLB advertised IP. Clean. No ports open on individual nodes. Exactly what I wanted.

Keepalived handled the cluster join entry point. A floating IP that survived node failures. I tested this manually during bootstrap. Simulated a node failure. Keepalived moved the IP cleanly to a healthy node. That part worked exactly as expected.

The BGP setup worked too. Technically. I ran it for a month.

Then I measured HTTP latency.

The requests going through the Hetzner cloud network into the BGP mesh and back out were noticeably slow. Not broken. Just slow in a way that didn't make sense given the hardware. I only had ingress exposed so I couldn't measure other protocols. Whether it was Hetzner Cloud's routing adding overhead or something in the mesh I never definitively diagnosed.

I reverted. Wrote it off as a cloud networking quirk. The internode BGP mesh stayed. Calico peers between all five nodes remained stable. But MetalLB and the cloud LB routing experiment went back to Keepalived managing a floating IP with Hetzner API hooks for failover.

Something else happened at this stage. The MongoDB crashloop stopped. I don't know exactly when. I don't know exactly why. It had followed me through three different infrastructure setups. Different providers, different CNIs, different everything. Then somewhere around moving to Calico it just stopped. Never diagnosed. Never explained. Just gone.

Around a hundred workloads running at this point. Bots, websites, the WebSocket infrastructure, client services accumulated over years.

What ended Stage 4 wasn't a failure. It was a conference.

Interlude: The Homelab Test

Before this I validated experiments on DigitalOcean VPS. Spin one up, test the idea, tear it down. Clean. Disposable. No commitment.

But before committing rke2 and Cilium to production I wanted to test on real hardware. Not a cloud VM. Actual bare metal.

I had an old PC. Booted Ventoy, loaded a gparted live image, used dd to write Debian directly to the disk. Configured SSH and networking. Came back to my main machine and SSHed in. Bare metal provisioning the old fashioned way. No cloud console, no managed image, just a disk, Ventoy, and a Debian ISO.

Installed rke2. Installed Cilium. Configured BGP and set static routes in my home router to expose services publicly. Ran a PostgreSQL instance. One database per project, cleaner than running everything on my main machine. Threw a Minecraft server and an Ark Survival server on it so friends could connect.

It worked. Everything I planned for Stage 5 behaved the way I expected on real hardware before I committed it to production.

That's the point of a homelab. Not to show off. Not to run unnecessary infrastructure. To answer the question before production has to.

Stage 5 wasn't a leap of faith. It was a validated decision.

Stage 5: Zero Trust and the Hardened Cluster

The current cluster. OVH this time, not Hetzner.

OVH has better DDoS protection. Their uptime guarantees are stronger.

rke2 replaced k3s as the distribution. The operational feel is similar. rke2 is heavier. Larger resource footprint. But for a cluster running serious production workloads that weight is appropriate.

Cilium replaced Calico as CNI.

I saw Cilium at KubeCon. But the real reason was simpler. I had never successfully run a firewall alongside Kubernetes. Every attempt broke with CNI changes. iptables rules conflicting with CNI networking. Configurations that worked until a CNI update silently invalidated them. I gave up trying to run a traditional firewall on Kubernetes nodes entirely.

Cilium's host firewall was different. It runs at the eBPF layer, below where CNI conflicts happen. That was the actual problem I needed solved. KubeCon introduced me to the solution. The L2 load balancer announcement was a bonus.

L2 announcement for load balancer IPs. A few YAML manifests and the cluster manages IP assignment itself. The goal from Stage 4 finally solved. Differently than expected but solved. Keepalived still exists for the cluster join endpoint but the complex API hook dance for floating IP management is gone.

Host firewall via CiliumClusterwideNetworkPolicy. The public IP of every node is egress only. Nodes can initiate connections to the world. The world cannot initiate connections to nodes. Not even ping. Except from monitoring systems and my static IP for SSH. Enforced at the kernel level on every node via eBPF. Not a firewall rule sitting in front of the infrastructure. The infrastructure itself is the firewall.

Ingress hardened to Cloudflare only. Traefik as the ingress controller. Cilium network policy restricts ingress traffic to Cloudflare IP ranges only at the network level. Traefik middleware handles mTLS authenticated pull on top of that. Two layers of verification before any request reaches a service.

Namespace isolation by default. Network policies deny all ingress by default. Services explicitly whitelist what they need. DNS, shared resources, specific cross-namespace communication. The model is incoming-decides. Each namespace declares what it accepts rather than what it sends. No traffic sniffing between namespaces is possible by policy.

The cluster currently runs bots, websites, and accumulated client workloads. It is boring in the useful sense. Deployable, diagnosable, recoverable.

What Five Clusters Taught Me

Cost pressure is a better architect than planning. Every significant change in this infrastructure happened because something was too expensive relative to its value. The nginx VPS layer. The idle Hetzner RAM. The Oracle free tier nodes that couldn't handle non-disruptable workloads. Cost pressure forced clarity about what the infrastructure actually needed to do.

The first version is supposed to be wrong. One video and documentation got the first cluster running. That was enough. The wrongness of the first version taught me what the second version needed to be.

Persistent problems have hidden causes. MongoDB crashed across three different providers, two different CNIs, and multiple cluster rebuilds. Then stopped. I never found the root cause. Sometimes production systems have ghosts you learn to live with until they quietly leave on their own.

Security is a journey not a configuration. Stage 1 had no meaningful firewall story. Stage 5 has host-level eBPF enforcing zero trust at the kernel. That's not a single decision. That's five iterations of failing to run a firewall properly until the right tool existed.

Simplicity compounds. Every stage removed something. The nginx VPS layer. The cloud load balancer. The WireGuard entry point VPS. The Keepalived API hooks. Each removal made the system more reliable not less. The most complex part of Stage 1 was the load balancing layer doing the least important work.

Context beats technology. Multi-cloud sounds impressive. The reality was free Oracle ARM compute that cost nothing to add. Cilium sounds cutting edge. The reality was a persistent firewall problem that nothing else solved cleanly. The technology follows the problem. Not the other way around.

Validate before you migrate. I didn't move production to rke2 and Cilium based on a KubeCon talk. I went home, provisioned a bare metal server from scratch, and ran the stack on real hardware first. The homelab wasn't a hobby. It was due diligence.

Security - Five Stage Evolution

Stage Security Posture
Stage 1 No meaningful firewall · MongoDB on crashloop · Manual restart as incident response
Stage 2 WireGuard mesh · First network isolation layer · Removed public-facing VPS middlemen
Stage 3 Node affinity arch-based isolation · Multi-cloud scheduling boundaries · GoReleaser controlled build pipeline
Stage 4 Private VLAN for internode comms · Public interface egress only · Keepalived tested at bootstrap
Stage 5 eBPF host firewall nodes egress only · Cloudflare-only ingress + mTLS · Namespace isolation by default · No traffic sniffing between namespaces

Stage 1 ░░░░░░░░░░░░░░░░░░░░

Stage 2 ████░░░░░░░░░░░░░░░░

Stage 3 ████████░░░░░░░░░░░░

Stage 4 ████████████░░░░░░░░

Stage 5 ████████████████████

Security is a journey not a configuration. Five iterations of failing to run a firewall properly until the right tool existed.