croemer
10 hours ago
Preliminary post incident review: https://azure.status.microsoft/en-gb/status/history/
Timeline
15:45 UTC on 29 October 2025 – Customer impact began.
16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD.
16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.
16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.
17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.
17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.
17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration.
18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally.
18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally.
23:15 UTC on 29 October 2025 - PowerApps mitigation of dependency, and customers confirm mitigation.
00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers.
xnorswap
9 hours ago
33 minutes from impact to status page for a complete outage is a joke.
neya
8 hours ago
In Microsoft's defense, Azure has always been a complete joke. It's extremely developer unfriendly, buggy and overpriced.
michaelt
8 hours ago
If you call that defending microsoft, I'd hate to see what attacking them looks like :)
dijit
7 hours ago
Not to put too fine a point on it, but if I have a dark passenger in my tech life it is almost entirely caused by what Microsoft wants to inflict on humanity - and more importantly; how successful they are at doing it.
amelius
5 hours ago
In commenter's defense, their comment makes no sense.
dude250711
7 hours ago
Save it for when they stick Copilot into Azure portal.
alias_neo
6 hours ago
Ha, you haven't used it recently have you? Copilot is already there, and it can't do a single useful thing.
Me: "How do I connect [X] to [Y] using [Z]?"
Copilot: "Please select the AKS cluster you'd like to delete"
antonvdi
6 hours ago
They're already doing that.
nflekkhnnn
5 hours ago
Actually one of the inventors of k8s was the project lead for copilot in the azure portal, and deployed it over a year ago.
madjam002
7 hours ago
My favourite was the Azure CTO complaining that Git was unintuitive, clunky and difficult to use
macintux
6 hours ago
Isn’t it?
Hilift
6 hours ago
Ironically, the GitHub Desktop Windows app is quite nice.
dspillett
6 hours ago
Yes. But the point is compared to Azure in places the statement was very much the pot commenting on the kettles sooty arse. And git makes no particular pretence to be particularly friendly, just that it does a particular job efficiently.
lawgimenez
5 hours ago
Sounds like he’s describing Windows phone.
ac2u
4 hours ago
Feel like I have to defend windows phone here, I liked it! Although I swore off the platform after the hardware I bought wasn’t eligible for the windows phone 8 upgrade even though the hardware was less than two years old. They punished early adopters
sofixa
4 hours ago
> In Microsoft's defense, Azure has always been a complete joke. It's extremely developer unfriendly, buggy and overpriced.
Don't forget extremely insecure. There is a quarterly critical cross-tenant CVE with trivial exploitation for them, and it has been like that for years.
sfn42
6 hours ago
I've only used Azure, to me it seems fine ish. Some things are rather overcomplicated and it's far from perfect but I assumed the other providers were similarly complicated and imperfect.
Can't say I've experienced many bugs in there either. It definitely is overpriced but I assume they all are?
thayne
an hour ago
Unfortunately,that is also typical. I've seen it take longer than that for AWS to update their status page.
The reason is probably because changes to the status page require executive approval, because false positives could lead to bad publicity, and potentially having to reimburse customers for failing to meet SLAs.
ape4
an hour ago
Perhaps they could set the time to when it really started after executive approval.
HeavyStorm
37 minutes ago
I've been on bridges where people _forgot_ to send comms for dozens of minutes. Too many inexperienced people around these days.
imglorp
5 hours ago
That's about how long it took to bubble up three levels of management and then go past the PR and legal teams for approvals.
infaloda
8 hours ago
More importantly `15:45 UTC on 29 October 2025 – Customer impact began.
16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered. ` A 19-minute delay in alert is a joke.
Xss3
4 hours ago
That does not say it took 19 minutes for alerts to appear. Following could mean any amount of time.
sbergot
8 hours ago
and for a while the status was "there might be issues on azure portal".
ambentzen
7 hours ago
There might have been, but they didn't know because they couldn't access it. Could have been something totally unrelated.
onionisafruit
10 hours ago
At 16:04 “Investigation commenced”. Then at 16:15 “We began the investigation”. Which is it?
ssss11
8 hours ago
Quick coffee run before we get stuck in mate
ozim
8 hours ago
Load some carbs with chocolate chip cookies as well, that’s what I would do.
You don’t want to debug stuff with low sugar.
normie3000
5 hours ago
One crash after another
red-iron-pine
3 hours ago
burn a smoko and take a leak
not_a_bot_4sho
10 hours ago
I read it as the second investigation being specific to AFD. The first more general.
onionisafruit
4 hours ago
I think you’re right. I missed that subtlety on first reading.
oofbey
an hour ago
“Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations.”
Very circular way of saying “the validator didn’t do its job”. This is AFAICT a pretty fundamental root cause of the issue.
It’s never good enough to have a validator check the content and hope that finds all the issues. Validators are great and can speed a lot of things up. But because they are independent code paths they will always miss something. For critical services you have to assume the validator will be wrong, and be prepared to contain the damage WHEN it is wrong.