ezekg
9 hours ago
I wrote about my worst on-call experience here [0]. Back in Feb of this year, I had a unique customer workload take down my SaaS in the middle of the night, 2 nights in a row. It took a long time to find the root cause, but it ended up being an inefficient uniqueness lib I was using for background jobs. This particular customer's workload was queuing up millions of background jobs in Redis at a certain time every day, but each time a job was queued, the entire job set was iterated (synchronously) to assert uniqueness. Obviously, this didn't scale.
I'd rank these 2 nights as the most stressful times of my career. I wrestled with sleeplessness, hopelessness, imposter syndrome, etc.
[0]: https://keygen.sh/blog/that-one-time-keygen-went-down-for-5-...