Cloud Native ComputingContributory BlogsDevelopersDevOps

The Shaq Effect: Incident Management Hero Syndrome

0

We call it the Shaq effect, and you may not know it, but you’ve probably seen it.

The 2 a.m. outage page goes out, and there’s one engineer who’s always the first to respond. They identify the problem, determine the affected services or product areas, fix the issue (or know who to call to fix the issue), wake up the VP, draft the messages to send to customers and stakeholders, create the tickets to address why things went bad. Then at 9 a.m., they go back to the job they were hired to do. Backs are patted, and life goes back to normal … until the next 2 a.m. page.

At FireHydrant, we call this scenario — where a single “hero” is almost solely responsible for the entire incident management program — the Shaq effect.

Shaquille O’Neal is one of the most celebrated NBA players of all time. He played for six teams over his 19-year career and won countless awards — and for good reason. When the team needed two points, they knew they could throw the ball to Shaq, and let him go to work. He’d inevitably push someone around (hard not to do at 7’1” and 300+ pounds), dunk the ball, and the crowd would go wild.

The skill set (and probably the size) are different, but there are a lot of engineers playing the Shaq role at their company. And it may not initially seem like a problem. Games get won, incidents get resolved, it’s true. But when one person is saving the day every time, you’re not setting your organization or team up for success.

A win doesn’t always have to come from a backboard-shattering slam dunk, and relying on the most dominant player to save the day isn’t a scalable solution for an organization’s continued success.

If you see yourself in this scenario, you’re not alone. So many of the teams we talk to feel this pain. I was a Shaq myself at a previous company. It’s time to pass the ball and iinvest in the kind of strategic incident management practice that will help you make gains with overall reliability. I recommend starting with just two small steps and growing from there. This will require some work on the front end from your Shaq, but it’ll pay off in your long game.

  1. Document what Shaq does during an incident

As a company’s technology platform evolves, subject matter experts and incident responders can include engineers from varying technical backgrounds and specialties — who may not have the technical, tribal, or social knowledge to understand all of the intertwined components in operation. It becomes more important than ever to create a single source of truth for your service catalog, dependencies documentation, and incident management communication workflow. If that currently all lives in your Shaq’s head, what happens when they’re not around? Your team might be looking at a big fat L on that day.

Take a first start-small step of formalizing what we call an incident management runbook. This doesn’t have to mean setting aside a full day to write a step-by-step process. It can instead look like your Shaq simply “talking aloud” during an incident.

The next time they respond to a page, ask them to start a thread in the incident channel where they literally just think out loud. The key here is that they over communicate; don’t assume the reasoning behind any actions is already understood. It can be helpful for them to answer questions like:

  • I just got paged, what’s the first thing I do?
  • Where are the places I look to check on the status of our services?
  • How do I know who to call when I discover what service is down?
  • How do I know how to revert the last deploy for that service?
  • What impact does this incident have on customers and internal teams?
  • What are my thoughts on how to fix this issue going forward?

By doing this, you can document and operationalize the steps your Shaq is taking in their head organically during an incident. Once you’ve normed on those steps, the idea is to eventually get alignment by adding them to the company Wiki or an incident management tool or company Wiki, breaking that knowledge silo.

  1. Turn your star player in to a coach

You know who looks up to Shaq? Everyone (and not only cause he’s a tall dude). And the same is probably true of your Shaq. But the other members of the team also need the opportunity to learn, to expand their own skills, and even to bask in the hero’s glow once in a while.

The next time an incident arises, ask your Shaq to take on the Phil Jackson role and serve as coach. They can still provide guidance, give responders the info they need (or better yet, help them figure out where to find it), and lend a hand when they’re asked. But they’ll also be providing other members of the team the opportunity to step up and learn or hone new skills. It’ll also build their confidence in an area that many engineers feel insecure about.

Once Shaq’s coached a couple of times, maybe they miss a game or two. Not only does it give your hero (who, let’s face it, might be facing burn out at this point) a break, but there’s no better way to flag a weakness up the chain of command than by demonstrating what happens when your single point of failure isn’t around to fix things, and that can lead to allocation of resources you might need to build a more formal incident program.

A winning strategy

These two first steps can get you started on a path that moves your team away from whack-a-mole-style incident response to more strategic and holistic incident management. If your team is suffering from the Shaq effect, you might be winning games, but you may also be unintentionally masking the need for better incident management practices. This isn’t your fault though, and you’re not alone. By helping our companies shift toward a better incident management posture, we can improve things for our customers, for our teammates, and for ourselves.


-Malcolm Preston, Staff Software Engineer, FireHydrant