gucci-on-fleek
3 months ago
> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.
As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.
> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions
Also appreciate the honesty here.
> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]
> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.
Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)
Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.
eastdakota
3 months ago
Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.
gucci-on-fleek
3 months ago
Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:
- A product depends on frequent configuration updates to defend against attackers.
- A bad data file is pushed into production.
- The system is unable to easily/automatically recover from bad data files.
(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)
cocoa19
3 months ago
It might remind you of Crowdstrike because of the scale.
Outages are in a large majority of cases caused by change, either deployments of new versions or configuration changes.
harivyom
3 months ago
zone your deployments first -blue/green. Have a small blue zone, and test it out. If it works, then expand to green deployments.
A configuration file should not grow! design failure here, I want to understand
tptacek
3 months ago
Richard Cook #18 (and #10) strikes again!
https://how.complexsystems.fail/#18
It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".
asenchi
3 months ago
This document by Dr. Cook remains _the standard_ for systems failure. Thank you for bringing it into the discussion.
philipwhiuk
3 months ago
Why was Warp in London disabled temporarily. No mention of that change was discussed in the RCA despite it being called out in an update.
For London customers this made the impact more severe temporarily.
eastdakota
3 months ago
We incorrectly thought at the time it was attack traffic coming in via WARP into LHR. In reality it was just that the failures started showing up there first because of how the bad file propagated and where it was working hours in the world.
aaronmdjones
3 months ago
Probably because it was the London team that was actively investigating the incident and initially came to the conclusion that it may be a DDoS while being unable to authenticate to their own systems.
noir_lord
3 months ago
Given the time of the outage that makes sense, they’d mostly be within their work day time (if such a thing apples to us anymore).
dbetteridge
3 months ago
Question from a casual bystander, why not have a virtual/staging mini node that receives these feature file changes first and catches errors to veto full production push?
Or you do have something like this but the specific db permission change in this context only failed in production
forsalebypwner
3 months ago
I think the reasoning behind this is because of the nature of the file being pushed - from the post mortem:
"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."
gizmo686
3 months ago
In this case, the file fails quickly. A pretest that consists of just attempting to load the file would have caught it. Minutes is more than enough time to perform such a check.
prawn
3 months ago
Just asking out of curiosity, but roughly how many staff would've been involved in some way in sorting out the issue? Either outside regular hours or redirected from their planned work?
mindentropy
3 months ago
Is there some way to check the sanity of the configuration change, monitor it and then revert back to an earlier working configuration if things don't work out?
tetec1
3 months ago
Yeah, I can imagine that this insertion was some high-pressure job.
globalise83
3 months ago
The computer science equivalent of choosing between the red, green and blue wires when disarming a nuke with 15 seconds left on the clock
dylan604
3 months ago
Is it though? Or is it, oh, this is such a simple change that we really don't need to test it attitude? I'm not saying this applies to TFA, but some people are so confident that no pressure is felt.
However, you forgot that the lighting conditions are where only red lights from the klaxons are showing so you really can't differentiate the colors of the wires
hbbio
3 months ago
Thx for the explanation!
Side thought as we're working on 100% onchain systems (for digital assets security, different goals):
Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.
That could have blocked propagation of the oversized file long before it reached the edge :)
chrismorgan
3 months ago
> much better than their completely false “checking the security of your connection” message
The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):
> example.com needs to review the security of your connection before proceeding.
It bothers me how this bald-faced lie of a wording has persisted.
(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)
eastdakota
3 months ago
Next time open your dev console in your window and look at how much is going on in the background.
jpadkins
3 months ago
checking the checkbox does verify you are human, for the most part.