Have you ever caused a global outage?

3 pointsposted 10 hours ago
by soychai

1 Comments

soychai

10 hours ago

I run the web series Humans of Reliability, and in our most recent episode I interviewed Chris Ferraro (VP of Platform Eng at Garner Health), and he talked about the time he caused a global outage at Microsoft.

Here's a (slightly condensed) retelling of his story:

I'm one of three people (at least that I know) who’ve brought Microsoft down globally. We were making a group policy change, two of us at the same time. There was a bug in the software and voila—nobody cares exactly how it happened, it just happened. We were completely down for 15 minutes. Fortunately, one of my engineering partners Jason Hughes and I had put in some reliability tooling right before this particular change we made and we were alerted promptly. But it was still a global outage, and there’s really no “good” or “short” global outage, especially at Microsoft’s scale. I'm not gonna blow any sunshine up anybody here. We were able to respond quickly and bring Microsoft back up quickly, but it was a bad day.

I sat through the postmortem, and that was memorable to say the least. I've never had that many people in a room who are all concerned about what I was about to say. I think it's probably the most formative event in my life when it comes to being able to manage through chaos and adversity, and now being able to really be there for engineers when things go wrong—because they will. It showed me how we can all come together in those moments and make the situation better, not worse. But it definitely was the only moment in my career I ever thought “shoot, should I just walk out the door right now?”

The thing that kept me coming back, at least in the immediate sense, was curiosity. I kept thinking “How the F did that happen?” I went home and tried to recreate it. It didn't work. But I was like, the only thing that was anomalous was this one thing, so I was able to go back to the lab and keep trying to figure it out. Thank you to my managers for allowing me to have that crack at it, because I think I would have gone insane without it, but I went back. I found the thing that I thought was anomalous. I kept going back and I tested in the lab and we nailed it. So initially, it was just this burning curiosity, but long term—man, problem solving is fun. That’s sort of what life is about, it’s the reason why I was an engineer in the first place.

In that moment, I also had an engineering manager at Microsoft who gave me some feedback, and he chose such a harsh moment and delivery in doing so. I viscerally remember it. But that lesson came to fruition when I was a CTO at a Crypto startup. One of my engineers brought down prod by making a change to dev—and everybody knows environmental separation, we all say it's this golden way—but we never do it right the first time. Anyway, this happened to this engineer and I just looked at him and I remembered what that felt like, and that I could make it better or I could make it worse. So I said, “Hey, you brought down a crypto startup with no customers. We're gonna survive. I'm here. I brought down Microsoft globally. Let's push through. Don't worry. All your friends right now, they're just upset they got to work the weekend, they're not upset with you. We're gonna get through it. You're a great engineer. Let’s play on.”