hackernews client

What are your worst on-call stories?

18 pointsposted 10 months ago

Item id: 41602252

5 Comments

al_borland

9 months ago

I was pressured into relocating to a different state. On a holiday weekend I drove 6 hours home to see my family. Just as everyone was showing up I got a call from my boss that he desperately needed me on a huge outage. I retreat to a spare room and get on the call. There doesn’t seem to be much for me to actually do. At one point he asked me to do a vlookup in Excel, but someone else took care of it, as they were already managing the spreadsheet.

The boss told me I should be in the office the next day, as there would be a lot of cleanup to do. So once I got off the call I drove 6 hours home. 12 hours in the car for maybe 2 hours of seeing the family.

The next day I walk into the office and the ops team asks, “what are you doing here? There’s nothing for you to do, go home.”

Holiday weekend killed, family time pushed off for another 6 months until the next holiday worth the 12 hour drive, all for some optics and being available if they needed me, when they didn’t actually need me.

The worst on-call is the on-call that never should have happened, because your boss doesn’t understand what you do and has no boundaries.

cdaringe

10 months ago

Explosion at the factory. No humans injured, but RnD labs and manufacturing labs all halted due to side effects of the explosion. On call back then was just answering your phone. I foolishly answered. It was incorrect for my management to place such heavy responsibility onto me for leading our sliver of the recovery. I was fresh out of school, I did not understand how the products, the facilities, or how the business operations fundamentally worked at a low level, which would’ve deeply helped. Sure, there were senior engineers around, but they had other tasks in these critical moments as well. It took a while to feel supported. They really needed an experienced engineer running the show, not the most junior. Anyway, I worked for a month straight (longer days, weekends). A handful of us did. I was actively training for bike racing back then, and I knew my KPI‘s. Let’s just say that my fitness tanked after being uber stressed out for a month straight.

ezekg

10 months ago

I wrote about my worst on-call experience here [0]. Back in Feb of this year, I had a unique customer workload take down my SaaS in the middle of the night, 2 nights in a row. It took a long time to find the root cause, but it ended up being an inefficient uniqueness lib I was using for background jobs. This particular customer's workload was queuing up millions of background jobs in Redis at a certain time every day, but each time a job was queued, the entire job set was iterated (synchronously) to assert uniqueness. Obviously, this didn't scale.

I'd rank these 2 nights as the most stressful times of my career. I wrestled with sleeplessness, hopelessness, imposter syndrome, etc.

[0]: https://keygen.sh/blog/that-one-time-keygen-went-down-for-5-...

aristofun

10 months ago

After half year in production finally some malicious user hit my code with injection that crumbled the hi throughput system.

This was surprising for 2 reasons:

1. That nobody came up with the injection for so long.

2. That the injection was pretty silly and non obvious (kinda explains first point lol). Not your typical sql/js/css etc. it had something to do with localization and the cumbersome way it was implemented in some java libraries

noop_joe

10 months ago

One of the most difficult challenges with incidents is dispelling the initial conjecture. Something bad happens and a lot of theories flood the discussion. Engineers work to prove or disprove those theories, but the story about one might take on a life of its own outside the dev team. What then ends up happening is post-incident there's a lot of work to not only show that the problem was the result of XYZ, but also it definitely wasn't the result of ABC.

I was responsible for wsj.com for a few years. The homepage, articles and section fronts were considered dial-tone services (cannot under any circumstances go down). My job was to lead the transition from the on-prem site to the redesigned cloud site. As you can imagine there were a few hiccups along that journey.

One particular incident we encountered was when reporters broke the news of a few unrelated industry computer system failures (including finance). Because it was about a financial system, people visited wsj, the spike in traffic was so large it knocked us out. Now other news outlets were reporting wsj down. Unfortunately, there was a perception that these incidents were all related by a coordinated hacking event.

Each minute the site had an interruption of service, I would need to spend hours post-incident making sure the causes were understood, verified and stakeholders knew what they were.

All in all, the on-call experiences were fine. Sure people were tired if they happened in them middle of the night, but the team was supportive and there was a culture of direct problem solving that didn't add _extra_ stress.

Fun stuff.