Runbooks   / Work-Life Balance

Prevent after-hours paging burnout

When to apply this runbook

Getting paged for the services you own is a common DevOps practice. However, excessive paging (especially during non-work hours) can cause burnout and may indicate broader issues. In our experience, managers are not always fully aware of the pain engineers go through because of paging. Here are the signals to look for:

  • One or several engineers getting gradually disengaged / negative during team meetings or looking very tired.
  • On the contrary, team mates reporting zero complaints about after-hours pages while there is anecdotal evidence that the paging load is high.
  • A higher degree of conflict between people about SLAs, Post-Mortems, etc. or an unexplainable avoidance of conflict.

Instructions

  1. Get the numbers

    First, make sure you are on top of the rate of pages, especially after-hours:

    • In many cases, your alerting / paging software will have an analytics section where you can look at trends for each rotation.
    • For more complex cases, you might need to script several systems or rely on proxy metrics like message counts in an #incident Slack channel.

    Second, get a pulse on the team:

    • We recommend running a mini-survey to ask questions like "What is your level of energy compared to the past month?". Look for negative answers as well as answers that seem too good to be true. In the latter case, it usually means that your team is already past the learned helplessness stage - a clear sign you should act immediately.
    • Bring up the topic in 1/1s.
  2. Take action

    To prevent burnout:

    • Assess the signal-noise ratio of every alert your team receives. Cut the noisy alerts aggressively.
    • For those teams that are in central parts of the stack (e.g. database), you might need to create much larger rotations and train more people. The volume of pages might be naturally higher due to the sheer scale of these services.
    • Prioritize solving the root cause behind the instability of your services.
    • Establish goals around a maximum number of pages / week for team members. Make yourself accountable to meet these goals, not the engineers.
  3. Create a self-correcting process

    To prevent things from getting worse again:

    • Create a script or automation of some kind to receive an after-hours report on a daily or weekly basis.
    • Regularly survey engineers about page load.
    • Make after-hours work a regular topic of retrospectives.

Going further

Okay automates the process of receiving after-hours paging reports. Please reach out to us if you need help!