On-Call Wellness: Protecting Your Engineers from Burnout

The On-Call Burnout Epidemic

I watched three senior SREs leave our team in six months. Exit interviews all said the same thing: on-call was unsustainable.

We were spending $500K+ recruiting replacements for a problem that could have been fixed with $0 and better practices.

The Warning Signs

Before someone quits, they show these signals:

Cynicism in post-mortems — "This will never get fixed"
Alert numbness — Slow to respond, missed pages
Vacation avoidance — "I can't take time off, who would cover?"
Scope creep rejection — "That's not my problem"
Meeting silence — Previously engaged, now checked out

If you see three or more of these in someone on your team, they're already halfway out the door.

What We Changed

1. Hard Cap on Pages

We set a maximum of 2 pages per 8-hour on-call shift. If someone gets paged more than that, the secondary automatically takes over and the incident is escalated as a process failure.

on_call_policy:
  max_pages_per_shift: 2
  shift_duration_hours: 8
  overflow_action: "escalate_to_secondary"
  overflow_review: "weekly_ops_review"

2. Follow-the-Sun Rotation

We stopped asking people to be on-call at 3am. With team members across US timezones, we created overlapping business-hours shifts:

Shift A (Eastern):  6am - 2pm ET
Shift B (Central):  11am - 7pm CT  
Shift C (Pacific):  2pm - 10pm PT
Overnight:          Managed by alert automation + escalation

Nobody gets paged between 10pm and 6am unless it's a true P1.

3. On-Call Compensation

We implemented:

$500 flat fee per on-call week
$200 per off-hours page
Comp day after any overnight incident > 30 minutes
On-call swaps require zero management approval

4. The "Toil Budget"

Each engineer gets a toil budget: maximum 30% of their time on operational work. If toil exceeds 30%, they're pulled from on-call until the team automates the excess.

Weekly toil tracking:
  Alert response:     4 hours
  Manual deployments: 2 hours  
  Config updates:     1 hour
  Ad-hoc debugging:   3 hours
  ─────────────────────────
  Total:              10 hours (25% of 40hr week) ✓

5. Quarterly On-Call Reviews

Every quarter, we review:

Pages per person
Off-hours disruptions
Toil percentages
Team sentiment survey
Attrition risk signals

The Results

Metric	Before	After (6 months)
Attrition rate	40%/year	8%/year
Pages per shift	4.7	1.2
Off-hours pages	12/week	2/week
Team NPS	-15	+45
Recruitment cost saved	-	~$400K/year

The Key Insight

On-call wellness isn't a perk. It's a business decision. Replacing a senior SRE costs $150-200K in recruiting, onboarding, and lost productivity. Preventing burnout costs almost nothing.

If you're looking to reduce on-call toil and protect your team from burnout, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

推荐订阅源

DEV Community