time0ut
4 months ago
Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.
The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
SOLAR_FIELDS
4 months ago
This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.
1970-01-01
4 months ago
I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.
ttul
4 months ago
There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.
vladvasiliu
4 months ago
> Identity Center and only put it in us-east-1
Is it possible to have it in multiple regions? Last I checked, it only accepted one region. You needed to remove it first if you wanted to move it.
barbazoo
4 months ago
Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.
shawabawa3
4 months ago
for what it's worth, we were unable to login with root credentials anyway
i don't think any method of auth was working for accessing the AWS console
reenorap
4 months ago
It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.
hinkley
4 months ago
Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.
Who watches the watchers.
ej_campbell
4 months ago
Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.
The usability of AWS is so poor.
ct520
4 months ago
I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright
ransom1538
4 months ago
People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.
ozim
4 months ago
Sounds like a lot of companies need to update their BCP after this incident.
michaelcampbell
4 months ago
"If you're able to do your job, InfoSec isn't doing theirs"
ct_list
4 months ago
[dead]
saltserv
4 months ago
[dead]