Your Runbook Is Written. Nobody Runs It.

The runbook was pinned. Owners in the footer. Audit passed. Then a routine change hits, the on-call opens the page, step four does not match what production actually is, and the sane move is to close the tab and do what the last hero did ... because nobody has treated "bring the doc back to truth" as release-critical work. That is not a missing wiki problem. That is a trust problem wearing a Confluence costume. The link is green. The organization is lying to itself in polite font.

The obvious failure is nobody writes it down after the firefight. Embarrassing. Loud. You can fix it with shame.

The dangerous failure is somebody did write it down, the template looks healthy, leadership pastes a URL like they just saved the company ... and the team learns the real rule anyway. The real rule is the DM thread. The real rule is whoever was on call last time. The real rule is whatever gets the release out before the executive thread wakes up again. Paper compliance feels adult. It is still a lie, and you have sat in a review where everyone nods while knowing it.

I used to think reliability was mostly capture. Turns out capture is the easy half. The hard half is enforcement when nobody is clapping.

Midnight in the incident channel ... everyone is a saint. Tuesday at 2 PM is where your religion gets tested. The sprint is tight. A PM is asking for a small exception. The runbook update is fifteen minutes nobody thinks they have. Nobody is confused about what gets rewarded in that moment. Skip the doc. Ship the thing. Be the hero who unblocks.

The incident thread gets executive attention. The runbook ticket floats. Everyone agrees it matters in principle. Nobody loses their job when it slips in practice. So the system does what systems always do when behavior is not backed by consequence. It stores risk in humans again. Then one person goes on PTO and the whole room learns what "distributed" never meant.

A runbook nobody runs is a liability dressed up as maturity.

Half a bridge

Documentation answers a shallow question ... can we describe how this works? A behavior contract answers the harder one ... will we behave the same way next week when nobody is watching?

If your reliability strategy stops at documentation, you built half a bridge. The other half is what leadership treats as non-negotiable during normal work, not only during incident theater. That is why I care whether decision records ever leave the doc and why I separate real debt from strong opinions dressed up as risk. Words that do not change behavior are expensive storytelling. Words linked to gates, owners, and verification change behavior.

Behavior contracts show up in boring places. A release checklist that cannot be marked done without a link to the updated runbook section. A named owner who verifies steps in staging (not only in a postmortem doc). A leadership review that asks for the diff in the runbook the same way it asks for the diff in code. If those are missing, your runbook is not operational memory. It is a comfort object you hug when auditors show up. I say that with love. I have hugged a few.

Same tax, slower burn

If you want the human shape of this pattern, it is the same Brent-shaped week where one person on PTO can stall three lanes before lunch. Heroics hide missing operational memory. Runbooks nobody runs are the slower version of the same tax. You do not feel it until the strong engineer is out and the org discovers it has been mistaking access to a URL for distributed capability.

Here is the social dynamic that quietly murders you. The firefight earns status because it is visible and it is fast and it makes everybody feel like something important happened. Prevention earns patience, and patience is always the first thing that gets cut when the roadmap shows up with a quarter-shaped deadline.

So you get this weird split. The outage gets the war room energy. The runbook follow-up gets a ticket that sits there while people "get back to real work." Leadership can praise mitigation in a standup because mitigation is a clean story. Recurrence work is messier, so it never wins a real slot on anyone's schedule, and after a while the org starts saying "we learned a lot" like that phrase is magic. If nothing in the system changed, you did not learn. You narrated. And the next release is not obligated to applaud your narrative.

The confession I owe

Earlier in my career, I fed this beast without meaning to.

Incident response felt like leadership because it felt alive. I showed up in channels. I asked good questions. I also rewarded velocity in planning forums in ways that made the acceptable answer obvious. The team optimized for what I treated as urgent. Prevention work is rarely urgent until it is catastrophic, so it kept getting deferred, and deferred work always looks rational until the bill arrives.

The ugly truth is I confused presence with leadership. I confused motion with maturity.

The fix is not guilt. The fix is changing what gets treated as incomplete.

What actually moved the needle

The shift that moved behavior for us was embarrassingly small on paper and expensive in discipline. We stopped treating incident closure as a single state. Mitigated meant customer impact was controlled. Closed meant prevention was verified and shipped. Most teams collapse those into one checkbox, and that is where follow-through dies. The ticket closes when the pager quiets down, the harder prevention work becomes a "later" task, and later rarely survives the next priority wave.

Splitting those states forced the conversation nobody wants in the moment and everybody wants in retrospect. No one could claim completion without prevention evidence. Release leads could see open risk in plain language. Leadership reviews stopped rewarding fast mitigation alone.

If you want every step on the habit side, start with mitigated versus actually closed. Reliability is not a documentation problem. It is a courage problem about what you enforce when the room is tired ... and who gets embarrassed when you say "we are not done."

Twenty minutes before you ship

If you want something practical that does not require a new platform, run a twenty-minute reliability follow-through review before release. Ask what failed last time in a way that could repeat, what changed in the runbook and in ownership, which action is still open, and whether this release should be gated until that is done.

If you never ask the gating question, you are doing ceremony. If you ask it and never hold the line, you are doing theater. Theater is expensive. It buys calm meetings and expensive weekends.

Speed without recurrence control is fake speed. You are borrowing from future stability, and the interest rate is ugly.

If you only visit reliability in a deck

If you are a director or VP who has not personally verified that your teams run their operational docs under normal load, not only during audits, you do not get to claim "we are mature about reliability." Maturity is what happens when the boring step survives contact with a busy Tuesday.

This is the same posture as staying close enough to the system that your judgment is not theoretical. I still read pull requests to understand how teams think across regions. I still care what happens after incidents because that is where culture becomes real. If your only relationship to reliability is a dashboard and a quarterly review, you are not managing reliability. You are managing narrative. Narrative does not reboot production.

If the same failure can happen twice, treat the second time like a leadership signal, not bad luck dressed up as engineering mystery.

What I measure when I do not trust vibes

Pick thirty days and track three numbers without drama. Repeat incidents by service area, percent of incidents with runbook updates completed before the next release, and percent of follow-through actions closed by due date. If repeat rate is flat and closure rate is low, your incident program is still response-heavy. If repeat rate drops while closure rate rises, your leadership habits are improving, and reliability is finally compounding. Numbers do not replace judgment. They stop leadership from lying to itself with heroic language.

Fix the foundation before you polish the toy

If your team is debating another model wrapper while the foundation squeaks, fix the system before you polish the prompt. AI will happily help you ship more change into a brittle operational surface. It will not volunteer to be the adult in the room when your runbook is ornamental. The problems have not changed, only the marketing around them ... and "more intelligence" does not substitute for "we agreed how we behave when nobody is watching."

When the humans align first

My team learned a version of this the hard way during the AI tooling wave, before we deserved to talk about agents like adults. We rolled tooling out, pull requests got bigger, review time did not magically grow to match, and the codebase started collecting inconsistent patterns like a junk drawer with a CI/CD pipeline. That is not a metaphor I use for cuteness. It is what it feels like when output accelerates faster than shared taste.

The fix was not a lecture. It was boring alignment work ... patterns, logging expectations, error handling defaults, architecture notes in markdown, the kind of stuff nobody wants to present at a town hall. Once humans agreed what "good" meant in our repo, the tools had something to amplify besides chaos. Guardrails stopped being a vibe and became something you could actually review. If you skip that step, your runbook is not the only ornamental artifact in the building. Your whole engineering culture is.

This week

Link the runbook to the release gate, or stop pretending it is operational memory. Borrow the habit that mattered when we split mitigated from actually closed. If critical context still lives in one person's head, your runbook is probably theater no matter how pretty the wiki is.

You do not need more templates. You need fewer stories you tell yourself while the next on-call discovers the sanctioned path and the real path are not the same thing.

So pick the fight you have been avoiding. The boring one. The one that does not look good on LinkedIn. That is the one that keeps you out of the ditch.

One email a week from The Builder's Leader. The frameworks, the blind spots, and the conversations most leaders avoid. Subscribe for free.

推荐订阅源

DEV Community