Longhorn Volume Health: The Gap Between 'Healthy' and Actually Working

I once spent four hours debugging a PostgreSQL pod that was stuck in a crash loop with Input/output error across every single log line. I opened the Longhorn UI, and there it was: a bright green "Healthy" badge next to the volume. The replicas were synchronized, the nodes were up, and the dashboard insisted everything was perfect.

The reality was a stale mount on the worker node that had survived a pod migration, leaving the filesystem in a read-only state that Longhorn's control plane didn't care about.

If you're running stateful workloads on bare metal, you've probably already realized that Longhorn is great until it isn't. It simplifies distributed storage, but it introduces a layer of abstraction that can lie to you. You need to know the difference between "Control Plane Healthy" and "Data Plane Functional."

The Illusion of Health

In Longhorn, "Healthy" usually just means the replicas are in sync and the volume is attached to a node. It does not mean the application can actually write to the disk. I've hit this multiple times where the volume is technically healthy, but the pod is screaming because of permission mismatches or stale mounts.

The most common culprit is the mount layer. When a pod moves from Node A to Node B, Kubernetes expects the volume to detach and re-attach. Sometimes, the detach fails or the mount stays active on the old node. Longhorn might show the volume as attached to the new node, but the OS on the worker is still holding onto a ghost mount.

If you see I/O error in your logs but the UI is green, stop looking at the UI. You need to check the actual mount point on the worker node.

Solving the Stale Mount Trap

When a volume gets stuck, the "happy path" is to let Kubernetes handle the detachment. In reality, you often have to force the issue.

The first thing I try is scaling the deployment to zero. This forces Kubernetes to send the detach signal to the CSI driver.

# Scale to 0 to break the lock and force volume detach
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-db
spec:
  replicas: 0

If that doesn't work, you have to go into the worker node via SSH. I've found that manually unmounting the path usually clears the deadlock. Be careful here: if you unmount a volume that is actually being written to, you're asking for filesystem corruption.

# Check for stale mounts on the worker node
mount | grep longhorn

# If you find a mount that shouldn't be there
umount -l /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-xxxx-xxxx/mounted

The -l (lazy) unmount is the secret here. It detaches the filesystem from the hierarchy immediately, even if the resource is busy, and cleans up the references once the resource is no longer in use.

The Capacity Lie: Snapshot Bloat

Capacity management in Longhorn is where most people run into their first "production" outage. You set up a 100GB PVC, and a few months later, your node disks are at 95% capacity even though your application is only using 20GB of data.

This is snapshot bloat. Longhorn snapshots are incremental, but if you have a high-churn database (like Postgres or MariaDB) and a reckless snapshot schedule, those increments add up.

I learned this the hard way when I set up an hourly snapshot policy without a strict retention limit. The snapshots were accumulating on detached volumes that I had forgotten to delete. Longhorn doesn't automatically purge snapshots for volumes that aren't currently attached to a pod unless you explicitly tell it to.

To fix this, I adjusted my SnapshotSchedule to exclude detached volumes. This prevents the system from wasting IO and space on volumes that aren't even active.

apiVersion: longhorn.io/v1beta1
kind: SnapshotSchedule
metadata:
  name: daily-backup-critical
spec:
  schedule: "0 2 * * *" # 2 AM daily
  retention: 7           # Keep only 7 days
  excludeDetachedVolumes: true # Stop snapshotting dead volumes

If you're already in a capacity crisis, don't just delete PVCs. Check for orphaned replicas. Sometimes a PVC is deleted from K8s, but the Longhorn volume remains in the UI as "detached." These are ghosts eating your disk space. Purge them manually from the UI or via the API.

Permissions and the SecurityContext Gap

Another "health" issue that doesn't show up in monitoring is the Permission denied error. Longhorn mounts volumes as root by default. If you're running a container as a non-root user (which you should be), the application will fail to write to the volume immediately upon startup.

I ran into this with an n8n deployment. The pod was "Running," the volume was "Healthy," but the logs were a wall of permission errors.

The fix isn't to chmod 777 the volume (don't do that). The fix is using the fsGroup in the securityContext. This tells Kubernetes to change the ownership of the volume to a specific GID when it's mounted.

spec:
  securityContext:
    fsGroup: 1000 # Matches the UID/GID of the application user
  containers:
    - name: n8n-app
      image: n8nio/n8n:latest
      # ... rest of config

For databases, I also recommend being explicit about the data directory. Some images default to a path that might conflict with how the volume is mounted. I always override the data path to a sub-directory to avoid issues with the lost+found folder that Linux creates on the root of the volume.

env:
  - name: PGDATA
    value: "/var/lib/postgresql/data/pgdata"

Monitoring That Actually Matters

If you want to stop guessing, you need to move beyond the Longhorn UI. I use Prometheus and Grafana to track the actual replication state.

The metric I watch most closely is longhorn_volume_replica_state. If a replica moves from healthy to degraded or faulted, I want an alert before the application notices.

One specific thing to watch for is the "Replica Count" vs "Healthy Replica Count." If you have 3 replicas but only 2 are healthy, you're one disk failure away from a total outage. This is a silent killer because the volume will still report as "Healthy" in the UI as long as one replica is available.

I've integrated these alerts into my general infrastructure monitoring. If you're managing this at scale, I highly recommend looking into predictive maintenance consulting to set up these thresholds before you hit a "disk full" panic at 3 AM.

Gotchas and Tradeoffs

I've considered using Rook-Ceph for larger workloads, and while it's more powerful, it's a nightmare to manage in a small cluster. Longhorn is the right choice for most homelabs and small production setups, but you have to accept the tradeoffs:

CPU Overhead: Longhorn runs a manager pod for every volume. If you have 100 small volumes, your CPU usage will spike just from the management overhead.
Disk Pressure: Longhorn doesn't have a native "thin provisioning" that's as transparent as some enterprise arrays. You need to monitor the actual node disk usage, not just the PVC usage.
PDB Conflicts: If you have strict Pod Disruption Budgets (PDBs), you might find that kubectl drain hangs forever because Longhorn is struggling to move a volume. I've written about this in Pod Disruption Budgets: Why kubectl drain Gets Stuck on Longhorn.

Lessons Learned

The biggest takeaway from managing Longhorn on bare metal is that the storage layer is not a "set and forget" component.

If you're building out your storage, start with a solid foundation. I've detailed the initial setup in Kubernetes Storage on Bare Metal: Longhorn in Practice, but the operational side is where the real work is.

Always assume the UI is lying to you. When a pod fails, check the logs for I/O errors first, then check the worker node mounts, and only then trust the green checkmark in the Longhorn dashboard. Use fsGroup for every stateful app, set strict retention on your snapshots, and for the love of your sanity, exclude detached volumes from your backup schedules.

推荐订阅源

DEV Community