What To Do When Things Go Wrong

Another one of my mantras (witticisms? Sayings?) is:

People don’t remember what you did when things were going well. People remember what you do when things go wrong.

If a leader walks up to your desk and things are going great, then they usually don’t remember the interaction. But if they walk into your office because the website is down, they’re definitely going to remember how you behaved, how quickly you solved the problem, and how you kept the problem from ever happening again.

Many people have told me that I’m calm and focused during a crisis. This is partly because of this mantra – people remember what you do when things go wrong. But it also goes back to my childhood.

I was an excellent swimmer (e.g., I set the all-time record on my community swim team for the 14-and-under 50m breaststroke). And excellent swimmers become lifeguards. And lifeguards learn CPR and first aid. And the relevant bit – how to focus in a crisis – has stuck with me.

Within the professional setting, I learned that when things go wrong, that’s the time for you to think very hard about what the BEST thing you can do is, and then DO IT. Don’t let fear hold you back or drive up your anxiety. Don’t hesitate, but be slow, thoughtful, and calm. In a crisis, slow is smooth, and smooth is fast.

And I also learned that the best way to deal with a crisis is to be prepared for it, train for it, and make dealing with it routine. That’s the reason that CPR has a book, and why ER doctors ask you what day it is when you roll into the room, and critical software systems have disaster recovery plans.

Good leaders invest time to ensure that your best practices and SOPs are well documented. Ideally, get them done before the code goes live, but realistically, as soon after the launch as you can. Document what you should do when things go wrong, BEFORE they go wrong, because they will go spectacularly wrong.

The first and most important thing is “stop the bleeding.” On the internet, this usually distills down to “never roll forward; always roll back,” because 9 times out of 10, that’ll stop the problem. Yes, it means that feature you rolled out earlier today isn’t live anymore, but hey, the checkout page is back, and we’re back to selling products, so that’s a good thing.

During this phase, don’t ask why it happened. Don’t focus on finding the bug in the code, and don’t try to blame someone for writing that bug. Sometimes, yeah, you need to understand the bug to understand the problem. But most of the time? Roll back the deployment. Get to the last good state. Stop the bleeding.

But sometimes rolling back isn’t an option, or doesn’t solve the problem. In those cases, take in all the information available. Scour your dashboards. Listen carefully, find the thing that smells fishy, and relentlessly poke at every assumption. When you’re brainstorming hypotheses, look for one that explains more symptoms than the others. Then look for other signs and symptoms that would be related, if that hypothesized root cause was true. If they’re there, then you’re closer to truth. If not, try another idea. Again, your goal is to stop the bleeding; prioritize the thing that does that.

Once you’ve stopped the bleeding, then start doing the investigation into why it happened. This can happen right away, or the next day, or during the next ops review meeting, but it has to happen. If you want to learn how to do that, check out the “5 whys” [https://en.wikipedia.org/wiki/Five_whys] for a great summary. Amazon uses the 5 whys because ultimately, you need to be able to ask, and then answer, the real question: how can we do better next time, or ideally, keep this from ever happening again?

To answer that question, you’ll have to ask others: What changes to our processes should we make to prevent a recurrence? Did the SOP tell you how to identify the root cause? Did it tell you how to stop the bleeding? Was it clear enough for you? Do we need to update it? If things had gone worse, how would they have done so? Assuming that worst-case scenario, what should we do that we didn’t do this time?

Every time an engineer gets paged due to an issue, as a leader, you should ask those questions. My goal was always to create a culture where we get paged less often next week than we did this week. This kind of continuous improvement means that two weeks (or six months, or two years) from now, we’ll be building new code, rather than fixing the code we’re writing today.

Discover more from Space on the back porch

Subscribe to get the latest posts sent to your email.