hackernews client

Cloudflare outage should not have happened

144 pointsposted 18 hours ago

(ebellani.github.io)

244 Comments

vessenes

17 hours ago

"If they had a perfectly normalized database, no NULLing and formally verified code, this bug would not have happened."

That may be. What's not specified there is the immense, immense cost of driving a dev org on those terms. It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.

Cloudflare may well need to transition to this sort of engineering culture, but there is no doubt that they would not be in the position they are in if they started with this culture -- they would have been too slow to capture the market.

I think critiques that have actionable plans for real dev teams are likely to be more useful than what, to me, reads as a sort of complaint from an ivory tower. Culture matters, shipping speed matters, quality matters, team DNA matters. That's what makes this stuff hard (and interesting!)

dfabulich

17 hours ago

That's entirely right. Products have to transition from fast-moving exploratory products to boring infrastructure. We have different goals and expectations for an ecommerce web app vs. a database, or a database vs. the software controlling an insulin pump.

Having said that, at this point, Cloudflare's core DDOS-protection proxy should now be built more like an insulin pump than like a web app. This thing needs to never go down worldwide, much more than it needs to ship a new feature fast.

jacquesm

17 hours ago

Precisely. This is key infrastructure we're talking about not some kind of webshop.

simlevesque

17 hours ago

Yeah but the anti-DDOS feature needs to react to new methods all the time, it's not a static thing you build once and it works forever.

An insulin pump is very different. Your human body, insulin, and physics aren't changing any time soon.

jacquesm

17 hours ago

You are simplifying the control software of an insulin point to a degree that does not match reality. I'm saying that because I actually reviewed the code of one and the amount of safety consciousness on display there was off the charts compared to what you usually encounter in typical web development. You also under-estimate the dynamic nature of the environment these pumps operate in as well as the amount of contingency planning that they embody, failure modes of each and every part in the pump were taken into consideration, and there are more such parts that you are most likely aware of. This includes material defects, defects as a result from abuse, wear & tear, parts being simply out of spec and so on.

To see this as the typical firmware that ships with say a calculator or a watch is to diminish the accomplishment considerably.

mikestorrent

11 hours ago

Thanks for spelling that out. It's so often tempting to be reductionist about things, but there is often a tremendous amount of thankless engineering inside products that we are privileged to consider as being somewhat boring. It takes a lot of work to make something so dynamic and life-critical and make it reliable enough to be considered simple, when it's anything but.

syockit

7 hours ago

The point still stands. The human body still isn't going change. That's why insulin pump can afford to have all kinds of rigorous engineering, while web-facing infrastructure on the other hand needs to be able to quickly adapt to changes.

jacquesm

2 hours ago

> That's why insulin pump can afford to have all kinds of rigorous engineering, while web-facing infrastructure on the other hand needs to be able to quickly adapt to changes.

The only reason we have a web in the first place is because of rigorous engineering. The whole thing was meant to be decentralized, if you're going to purposefully centralize a critical feature you are not going to get away with 'oh we need to quickly adapt to changes so let's abandon rigor'.

That's just irresponsible. In that case we'd be better off without CF. And I don't see CF arguing this, in fact I'm pretty sure that CF would be more than happy to expend the extra cycles so maybe stop attempting to make them look bad?

joshuamorton

16 hours ago

I had a former coworker who moved from the medical device industry to similar-to-cloudflare-web software. While he had some appreciation for the validation and intense QA they did (they didn't use formal methods, just heavy QA and deep specs), it became very clear to him very clearly that those approaches don't work with speed-of-release as a concern (his development cycles were annual, not weekly or daily). And they absolutely don't work in contexts where user-abuse or reactivity are necessary. The contexts are just totally different.

jacquesm

2 hours ago

It is perfectly possible to engineer for faster cycles without losing control over what your code can and can not do. It is harder, for sure. But I do not think it is a matter of this absolutely not working, that's black-and-white and it never is black and white, it is always some shade of gray.

For instance: validating a configuration before loading it is fairly standard practice, as are smoke tests and gradual roll-outs. Configuration fuck-ups are fairly common so you engineer with that in mind.

MichaelZuo

10 hours ago

If humans beings had a small chance to transform into say quadrupeds or suddenly grow tenatacles, extra hearts, organs, etc., in any given year… then wouldn’t designing a safe insulin pump literally be impossible?

jacquesm

2 hours ago

Compared to what they are already doing it would be marginally more difficult.

saghm

16 hours ago

All the more reason to be careful about relying on humans to avoid making mistakes when changing it rather than moving quickly and letting things fail in production.

wathef

17 hours ago

an insulin pump is a good metaphor; insulin as a hormone has a lot of interactions and the pump itself, if not wanting to unalive its user, should (most do not) account for external variables, such as: exercise, heart rate, sickness, etc. these variables are left for the user to deal with, and in this case, is a subpar experience in managing a condition.

Aperocky

17 hours ago

> This thing needs to never go down worldwide

Quantity introduce a quality all of its own in terms of maintenance.

bambax

17 hours ago

But does "formally verified code" really go in the same bag as "normalized database" and ensuring data integrity at the database level? The former is immensely complex and difficult; the other two are more like sound engineering principles?

necovek

17 hours ago

This bug might not have, but others would. Formal verification methods still rely on humans to input the formal specification, which is where problems happen.

As others point out, if they didn't really ship fast, they certainly would not have become profitable, and they would definitely not have captured the market to the extent they have.

But really, if the market was more distributed, and Cloudflare commanded 5% of the web as the biggest player, any single outage would have been limited in impact. So it's also about market behaviour: yet "nobody is fired for choosing IBM" as it used to go 40 years ago.

mosura

17 hours ago

Software people, especially coming through Rust, are falling into the old trap of believing if code is bug free it is reliable: it isn’t because there is a world of faults outside, including but not limited to the developer intentions.

This inverts everything because structuring to be fault tolerant, of the right things, changes what is a good idea almost entirely.

kjgkjhfkjf

8 hours ago

To be fair to Rust, the issue was an "unwrap" in the Rust code[0]. "unwrap" means "if the operation did not succeed then panic". Production Rust code should not use "unwrap", and should instead have logic to handle the failure case.

You don't need exotic formal verification methods to enforce this best practice. You just need a linter.

[0] https://blog.cloudflare.com/18-november-2025-outage/#memory-...

ViewTrick1002

16 hours ago

Rust generally forces you to acknowledge these faults. The problem is managing them in a sane way, which for Rust in many cases simply is failing loudly.

Compared to than many other languages which preferring chugging along and hoping that no downstream corruption happens.

gishh

12 hours ago

One of the backbones of the modern internet failed. Specifically, code written in rust failed.

The internet had a brown out because one of the most utilized companies on the web had a bug in their rust codebase. You’re excusing that away.

The amount of copium here is kind of embarrassing.

JuniperMesos

10 hours ago

There were several different components internal to Cloudflare that failed in a complex distributed systems context; the Rust failure is garnering more attention partially because it was a very legible failure, which also makes it easy for Cloudflare to fix this bug and all similar bugs quickly. The Cloudflare postmortem is a pro-Rust argument. It's also an argument that too many institutions rely on Cloudflare, which is a harder problem to solve.

saati

11 hours ago

Was it a memory error or a data race? No. Rust only promises that those won't happen in safe Rust. What is embarrassing is trying to pin this on a specific programming language.

mikestorrent

11 hours ago

What is embarrassing is that a language with a culture hell-bent on dominating the internet through largely unnecessary rewrites of existing tooling with the small justification of being "more secure" ended up being the culprit behind something of this scale.

It'd be different if this was in Ruby or PHP where nobody ever made any strong promises about safety and security. It's in the language-du-jour, though, and so it's ripe for critique.

(TBQH, Rustaceans, rewriting the GNU core utils is what showed me y'all just don't get it and are children playing among the ruins. In the end, you'll still have the same unix trash we have now.)

bigstrat2003

6 hours ago

This has nothing to do with the language, and it's so irritating to see people falsely claiming it is. There is nothing whatsoever about Rust that meant the engineer had to write code to the effect of

  if result.is_err() {
    panic!()
  }

That was a choice on the engineer's part, not something caused by the language. You could choose to write that code in any language. It might even be the right choice sometimes! But whether or not it was the right choice, the fact remains that responsibility stops with the programmer(s) who decided to have that code, not somehow with the language.

drob518

an hour ago

Agreed. A tool may allow the programmer to do something (with varying degrees of difficulty), but it’s always the programmer’s choice. Tools are inert by themselves. Only humans make choices.

saati

11 hours ago

But it wasn't the culprit, the code could have been in anything, or could have bubbled up errors to main, and it still would have failed with for an incorrect config.

gishh

10 hours ago

Right. So the language that espoused to eliminate errors that took down large positions of the internet, failed.

The specifics matter of course, but the mantra of rust as some safe language that should never have allowed something like this to happen, happened.

I vote we rename rust to “rustantic” in honor of human hubris.

jazzyjackson

10 hours ago

I just don't think you have the dunk you think you do. The Rust crowd is very adamant about preventing /many/ bugs. I rarely hear it recommended as a silver bullet that never fails.

igregoryca

8 hours ago

The only languages that eliminate logic bugs are formally verified ones, as the article points out. (And even then, your program is only as correct as your specification.) Ordinary Rust code is not formally verified. Anyone who claims Rust eliminates errors is either very naive or lying.

Type-safe Rust code is free from certain classes of errors. But that goes out the window the moment you parse input from the outside, because Rust types can enforce invariants (i.e. internal consistency), but input has no invariants. Rust doesn't ban you from crashing the program if you see input that violates an invariant. I don't know of any mainstream language that forbids crashing the program. (Maybe something like Ada? Not sure.)

I don't understand why you bemoan that Rust hasn't solved this problem, because it seems nigh unsolvable.

jacquesm

17 hours ago

When you're powering this large a fraction of the internet is it even an option not to work like that? You'd think that with that kind of market cap resource constraints should no longer be holding you back from doing things properly.

hdgvhicv

2 hours ago

The bug the fix here is the “powering a large fraction of the internet”

The lack of diversity is a major problem.

jacquesm

21 minutes ago

the -> to ?

And yes, agreed.

frumplestlatz

17 hours ago

I work in formal verification at a FAANG.

It is so wildly more expensive than traditional development that it is simply not feasible to apply it anywhere but absolutely the most critical paths, and even then, the properties asserted by formal verification are often quite a bit less powerful than necessary to truly guarantee something useful.

I want formal verification everywhere. I believe in provable correctness. I wish we could hire people capable of always writing software to that standard and maintaining those proofs alongside their work.

We really can’t, though. Its a frustrating reality of being human — we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.

dpark

17 hours ago

> we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.

This seems like a contradiction. If the smartest engineers you can hire are not smart enough to work within formal verification constraints then we in fact do not know how to do this.

If formal verification hinges on having perfect engineers then it’s useless because perfect engineers wouldn’t need formal verification.

trillic

14 hours ago

It’s not that we can’t do it, it’s that higher-velocity occasionally buggy code has proven time and time again to be significantly more profitable than formally verified. The juice is rarely worth the squeeze.

drob518

an hour ago

Agreed. Further, this has been true even ignoring formal verification. Who has been in the situation of making the choice to ship known-buggy code to make a release date or satisfy a customer demand for other functionality? All of us, I suspect, if we’re being honest. I certainly have.

dpark

13 hours ago

I generally agree with your assessment. But frumplestlatz also says that literally their smartest engineers are not smart enough to do formal verification.

frumplestlatz

15 hours ago

> If formal verification hinges on having perfect engineers then it’s useless because perfect engineers wouldn’t need formal verification.

It doesn’t hinge on having perfect engineers.

It hinges on engineers being able to model problems algebraically and completely, prove the equivalence of multiple such models at different layers of abstraction (including equivalence to the original code), and then prove that useful properties hold across those models.

dpark

15 hours ago

If the smartest engineers cannot do it, it doesn’t work.

This isn’t even getting to the practical question of whether it’s worth doing, given the significant additional cost. If the smartest folks you can find are not smart enough to use the framework then it’s useless.

Maybe this means the tooling is insufficient. Maybe it means the field isn’t mature enough. Whatever, if you need an IQ two standard deviations above normal and 10x as long it’s not real world useable today.

Nevermark

10 hours ago

> If the smartest engineers cannot do it, it doesn’t work.

[FIX:] ..., it doesn't work universally.

And the answer to that is pretty clear. It does not work universally. If every developer started only shipping code they had credibly formally verified, the vast majority of developers would go into shock at the scale of work to be done. Even the best "validators" would fall into career shredding pits, due to "minor" but now insurmountable dependencies in previously unverified projects. The vast majority of projects would go into unrecoverable stalls.

But formal validation can still work some of the time with the right people, on the right scale and kind of project, with the right amount of resources/time expended.

It isn't as if regular "best practices" programming works universally either. But validation is much harder.

drob518

an hour ago

> But formal validation can still work some of the time with the right people, on the right scale and kind of project, with the right amount of resources/time expended.

The problem is, it’s unclear exactly what those situations are or even should be. That lack of clarity causes us to fail to recognize when we could have applied these methods and so we just don’t. As much as I see value in formal methods, I’ve never worked with a team that has employed them. And I don’t think I’m at all unique in that.

jacquesm

15 hours ago

The big trick is - as far as I understand it - to acknowledge that systems fail and to engineer for dealing with those failures.

I support your efforts downthread for at least knowing whether or not underlying abstractions are able to generate a panic (which is a massive side effect) or are only able to return valid results or error flags. The higher level the abstraction the bigger the chance that there is a module somewhere in the stack that is able to blow it all up, at the highest level you can pretty much take it as read that this is the case.

So unless you engineer the whole thing from the ground up without any library modules it is impossible to guarantee that this is not the case and as far as I understand your argument you at least want to be informed when that is the case, or, alternatively, to cause the compiler to flag the situation down from your code as incompatible with the guarantees that you are asking for, is that a correct reading?

jacquesm

17 hours ago

Ok, let's start off with holding them to the same standards as avionics software development. The formal verification can wait.

dpark

17 hours ago

I don’t understand why anyone should want this. Why should normal engineering efforts be held to the same standards as life-critical systems? Why would anyone expect that CloudFlare DDoS protection be built to the standards of avionics equipment?

Also if we’re being fair, avionics software is far narrower in scope than just “software in general”. And even with that Boeing managed to kill a bunch of people with shitty software.

jacquesm

15 hours ago

> I don’t understand why anyone should want this.

That's ok, but then you should bow out of the conversation, which is between people that do understand why anyone should want this.

To have predictable behavior is a must have in some industries, less in others. At the level of infrastructure that is deemed critical by some - and I'm curious what JGC's position on this is - the ability to avoid this kind of outage carries a lot of value. The fact that you do not see that CF has achieved life-critical reach is one that tells me that most of this effort is probably going to waste, but I trust that John does see it my way and realizes that if there are ways to avoid these kind of issues they should be researched. Because service uptime is something very important to companies like Cloudflare.

Boeing managed to kill a bunch of people with shitty business practices, not with shitty software, the software did what it was built to do. It is the whole process around that software as well as the type certification process and regulatory oversight that failed dramatically.

dpark

14 hours ago

> That's ok, but then you should bow out of the conversation, which is between people that do understand why anyone should want this.

I was not making a statement that I am ignorant. I was saying I believe the proposal to model general software engineering after avionics is misguided and inviting you to clarify your position.

It is certainly valid to ask what CloudFlare or anyone else for that matter could learn from avionics engineering or from NASA or from civil engineering focused on large scale projects or anywhere else that good engineering practices might come from. However, there is a persistent undercurrent in discussions around software reliability and general software engineering that ignore the fact that there are major trade-offs made for different engineering efforts.

“Oh, look how reliable avionics are. We should just copy that.”

Cool, except I would bet avionics cost 100 times as much to build per line of code as anything CloudFlare has ever shipped. The design constraints are just fundamentally different. Avionics are built for a specific purpose in an effectively unchanging environment. If Cloudflare built their offerings in the same way, they would never ship new features, the quality of their request filtering would plummet as adversaries adjusted faster than CloudFlare could react, and realistically they would be overtaken by a competitor within a few years at most. They aren’t building avionics, so they shouldn’t engineer as if they are. Their engineering practices should reflect the reality of the environment in which they are building a product.

This is no different than people who ask, “Why don’t we build software the way we build bridges?” Because we’re not building bridges. Most bridges look exactly like some other bridge that was built 10 miles away. That’s nothing like building new software. That’s far more like deploying a new instance of existing software with slightly different config. And this is not to say that there is nothing for software engineers to learn from bridge building, but reductive “just do it like them” thinking is not useful.

> Boeing managed to kill a bunch of people with shitty business practices, not with shitty software, the software did what it was built to do.

The software was poorly designed. No doubt it was implemented the spec. Does that change the fact that the sum total of the engineering yielded a deadly result? There is no papering over the fact that “building to avionics standards” led direct to the deaths of 346 people in this case.

jacquesm

14 hours ago

> I was not making a statement that I am ignorant.

ok.

> I was saying I believe the proposal to model general software engineering after avionics is misguided and inviting you to clarify your position.

But we are not talking about 'general software engineering', we are talking about Cloudflare specifically and that makes a massive difference.

> It is certainly valid to ask what CloudFlare or anyone else for that matter could learn from avionics engineering or from NASA or from civil engineering focused on large scale projects or anywhere else that good engineering practices might come from. However, there is a persistent undercurrent in discussions around software reliability and general software engineering that ignore the fact that there are major trade-offs made for different engineering efforts.

I think we are all aware of those trade offs. We are focusing on a specific outage here that cost an absolute fortune and that used some very specific technical constructs and we are wondering if there would have been better alternatives either by using different constructs or by using different engineering principles.

> “Oh, look how reliable avionics are. We should just copy that.”

> Cool, except I would bet avionics cost 100 times as much to build per line of code as anything CloudFlare has ever shipped.

And there is a pretty good chance that had they done that that they would have come out ahead.

> The design constraints are just fundamentally different.

Yes, but not quite that different that lessons learned can not be transported. The main reason why aviation is different is because it is a regulated industry and - at least in the past - regulators have teeth, and without their stamp of approval you are simply not taking off with passengers on board.

> Avionics are built for a specific purpose in an effectively unchanging environment.

That is very much not the case. The environment aircraft are subject to are - and increasingly so due to climate change - dynamic to a point that would probably surprise you.

What is not changing is this: the price for unexpected outcomes in that industry is that at some point global air travel will no longer be seen as safe and that once that happens one of the engines behind our economies will start failing. In that sense the differences with Cloudflare are in fact not that large.

> If Cloudflare built their offerings in the same way, they would never ship new features, the quality of their request filtering would plummet as adversaries adjusted faster than CloudFlare could react, and realistically they would be overtaken by a competitor within a few years at most. They aren’t building avionics, so they shouldn’t engineer as if they are. Their engineering practices should reflect the reality of the environment in which they are building a product.

I do not believe that you are correct here. They could, they can afford it and they have reached a scale at which the door is firmly closed against competitors, this is not a two bit start-up anymore.

> This is no different than people who ask, “Why don’t we build software the way we build bridges?” Because we’re not building bridges. Most bridges look exactly like some other bridge that was built 10 miles away. That’s nothing like building new software. That’s far more like deploying a new instance of existing software with slightly different config.

This too does not show deep insight into the kind of engineering that goes into any particular bridge. That they look the same to you is just the outside, the interface. But how a particular bridge is anchored and engineered can be a world of a difference from another bridge in a different soil situation, even if they look identical. The big trick is that they all look like simple constructs, but they're not.

> The software was poorly designed. No doubt it was implemented the spec. Does that change the fact that the sum total of the engineering yielded a deadly result? There is no papering over the fact that “building to avionics standards” led direct to the deaths of 346 people in this case.

That is not what happened and that is not what the outcome of the accident investigation led to conclude.

Boeing fucked up, not some software engineer taking a short-cut. This was a top down managed disaster with multiple attempts to cover up the root cause and a complete failure of regulatory oversight.

dpark

12 hours ago

> I think we are all aware of those trade offs.

I'm not sure about that. This type of conversation tends toward "shit's easy syndrome" with complexities hand waved away and real trade offs given lip service consideration only. With respect to CloudFlare you specifically said "as soon as they become the cause of an outage they have invalidated their whole reason for existence". I don't know how to square black and white statements like that with an understanding of tradeoffs. A lot of companies would (and do) trade the potential for an outage against the ongoing value of CloudFlare's offerings.

> we are wondering if there would have been better alternatives either by using different constructs or by using different engineering principles.

I think what was actually said was "let's start off with holding them to the same standards as avionics software development". Not so much inquisitive as "shit's easy".

> And there is a pretty good chance that had they done that that they would have come out ahead.

How did you reach that conclusion? CloudFlare has taken a stock hit recently. Even if we attribute that 100% to their outage, they are still up 92% over the last year.

For comparison's sake, CloudFlare was founded after the 737 Max started development. I seriously doubt CloudFlare would have achieved its current success by attempting to ape avionics engineering.

> That is very much not the case. The environment aircraft are subject to are - and increasingly so due to climate change - dynamic to a point that would probably surprise you.

Did you honestly think I was referring to the actual weather? A plane built in 1970 will (assuming it's been maintained) still fly today just fine. The design constraints today are essentially the same and there are no adversaries out there changing the weather in a way that Boeing needs to continuously account for.

This is wholly different from CloudFlare, who is actively fighting botnets and other adversaries who are continuously adapting and changing tactics. The closest analog for avionics would probably be nation states that can scramble GPS.

> In that sense the differences with Cloudflare are in fact not that large.

In the sense that both are important and both happen to involve software, sure. In most other ways the differences are in fact very large.

> I do not believe that you are correct here. They could, they can afford it and they have reached a scale at which the door is firmly closed against competitors, this is not a two bit start-up anymore.

You are ignoring the reality of the situation, and it surfaces in self-contradictory statements like this. They have closed the door firmly on competition so now they need to focus on avionics-like engineering? Why? If their moat is unpassable they should just stop development and keep raking in money. The only reason that they even experienced this outage was because they are in continuous development.

The reality is that their moat is not that wide. If their adversaries or their competition outpace them, they could easily lose their customers to AWS or Azure or someone else.

> This too does not show deep insight into the kind of engineering that goes into any particular bridge. That they look the same to you is just the outside, the interface. But how a particular bridge is anchored and engineered can be a world of a difference from another bridge in a different soil situation, even if they look identical. The big trick is that they all look like simple constructs, but they're not.

Forest for the trees... I did not claim that the bridges are actually the same. But how to build foundations, how to span supports, how thick concrete needs to be and how much rebar, these are well established. Yes, there are calculations and designs but civil engineers have done an excellent job of building a large corpus of practical information that allows them to build bridges with confidence. (And this is definitely something we could learn from them.) Rarely are bridges built mostly with custom components that have never been used before.

> Boeing fucked up, not some software engineer taking a short-cut. This was a top down managed disaster with multiple attempts to cover up the root cause and a complete failure of regulatory oversight.

You're trying to hand wave this away as if I am blaming some individual Boeing engineer, but I'm not.

Engineering isn't just coding. Engineering is the planning and the designing and the building and the testing and everything else that makes the product what it is. Boeing created a system to mask the flight characteristics of their new plane, except it didn't actually work. (And also yes they lied to regulators about it.) If it actually worked it those two planes wouldn't have crashed. A product intended to make planes easier to fly is poorly engineered if it actually crashes planes.

khuey

17 hours ago

Are Cloudflare's customers willing to pay avionics software level prices?

jacquesm

17 hours ago

Given that Cloudflare's market cap is 1/2 of Boeing's and they are not making a physical product I would say: Clearly, yes.

vntok

17 hours ago

The vast majority of Cloudflare's "customers" are paying 0 to 20 dollars a month, for virtually the same protection coverage and features as most of their 200 dollars/mo customers. That's not remotely in the realm of avionics price structure, be it software or hardware.

jacquesm

16 hours ago

It is the aggregate they pay that counts here, not the individual payments.

A better comparison would be to compare this to airline passengers paying for their tickets, they pay a few hundred bucks in the expectation that they will arrive at their destination.

Besides, it is not the customers that determine Cloudflare's business model, Cloudflare does. Note that their whole business is to prevent outages and that as soon as they become the cause of an outage they have invalidated their whole reason for existence. Of course you could then turn this into a statistical argument that as long as they prevent more outages than they cause that they are a net benefit but that's not what this discussion is about, it is first and foremost about the standard of development they are held up against.

Ericsson identified similar issues in their offering long ago and created a very capable solution and I'm wondering if that would not have been a better choice for this kind of project, even if it would have resulted in more resource consumption.

dpark

16 hours ago

> as soon as they become the cause of an outage they have invalidated their whole reason for existence

This is a bar no engineering effort has ever met. “If you ever fail, even for a moment, there’s no reason for you to even exist.”

There have been 6 fatal passenger airplane crashes in the US this year alone. NASA only built 6 shuttles and 2 of those exploded, killing their crews. And these were life-preserving systems that failed.

Discussions around software engineering quality always seem to veer into spaces where we assign almost mythic properties to other engineering efforts in an attempt to paint software engineering as lazy or careless.

stoneforger

5 hours ago

The NASA example should highlight the normalisation of deviance. The Challenger o-rings had failed before and while engineers were very vocal about that, management overruled them. The foam impacts and tile loss were also a known factor in the Columbia disaster but the abort window is very small. Both point to perverse incentives: maintaining the gravy train. One comment made the point earlier that if Cloudflare were more thorough they would not have captured the market because they would be slower. Slow is smooth and smooth is fast but YMMV. At the end of the day everything can be tracked down to a system that incentivizes wealth accumulation over capability with the fixation that capability can be bought which is a lie.

pixl97

12 hours ago

Boeing only makes this class of software quality because they are forced to by law. No one does it unless there is a big expensive legal reason to do so.

jacquesm

12 hours ago

Indeed. But: if we want to call this level of infrastructural work 'software engineering' and the impact of failure is as large as it is then that's an argument for either voluntary application of a higher standard or eventual regulation and I'm pretty sure CF would prefer the former over the latter.

robocat

17 hours ago

Anyone in avionics software dev to give an opinion?

I would presume there's the same issue as parent said:

  It is so wildly more expensive than traditional development that it is simply not feasible to apply it anywhere but absolutely the most critical paths

jacquesm

17 hours ago

> Anyone in avionics software dev to give an opinion?

I've done some for fuel estimation of freighter jets (not quite avionics but close enough to get a sense for the development processes) and the amount of rigor involved in that one project made me a better developer for the rest of my career. Was it slow? Yes, it was very slow. A couple of thousand lines of code, a multiple of that in tests for a very long time compare to what it would normally take me.

But within the full envelope of possible inputs it performed exactly as advertised. The funny thing is that I'm not particularly proud of it, it was the process that kept things running even when my former games programmer mentality would have long ago said 'ship it'.

Some things you just need to do properly, or not at all.

vel0city

11 hours ago

Compare the prices for type certified parts on aircraft compared to comparable (but not proven) similar parts in the automotive space. Its crazy how much more expensive actually proving these parts perform to spec to the level required by aviation law.

It wouldn't surprise me to find doing the same kind of certifications for complex avionics software to be the same.

frumplestlatz

17 hours ago

Agreed.

I left out any commentary on `.unwrap()` from my original comment, but it’s an obvious example of something that should never have appeared in critical code.

lenkite

16 hours ago

ON HN, just a couple of years ago, a famous Rust programmer said that it is OK to use unwrap. Rustaceans supported this position. Cloudflare merely followed the community standard.

Using unwrap() in Rust is Okay

https://news.ycombinator.com/item?id=32385102 https://burntsushi.net/unwrap/

burntsushi

15 hours ago

I already had a conversation with the GP specifically: https://news.ycombinator.com/item?id=45979127

They aren't presenting a coherent philosophy. And when asked for examples, or to engage directly with examples in my blog, they can't or won't do it.

But yes, of course it's okay to use unwrap(). It's just an assertion. Assertions are fine.

frumplestlatz

15 hours ago

Result declares a type-level invariant — an assertion enforced by the compiler, not runtime — that the operation can fail.

Ignoring that is bypassing the type system. It means your types are either wrong, or your type system is incapable of modeling your true invariants.

In the case of the cloudflare error, their types were wrong. That was an avoidable failure. They needed to fix their type-level invariants, not yolo the issue with `.unwrap()`.

Your willful persistent lack of understanding doesn’t mean my philosophy is incoherent. Using `.unwrap()` is always an example of a failure to accurately model your invariants in the type system.

burntsushi

15 hours ago

Your definition of "correct" is completely incoherent. Just because an invariant that could be modeled by a type system is not modeled by the type system in any given scenario does not make it incorrect.

You can't engage with my examples and you provide none of your own. So continuing discussion with you is a waste of time.

frumplestlatz

15 hours ago

Invariants aren’t invariant if they’re variant.

This is literally what “invariant” means, and what a type system is built to model.

Declaring an invariant in the type system that you then violate is not correct code. I truly can’t even begin to guess at why you’re so voracious in your defense of this particularly poor practice.

[edit]

HN rate limits kicking in, so here’s my reply. I work for a FAANG but I’m not going to say which one. You or a relative are, with almost 100% certainty, relying on code written to that philosophy, by me, daily and widely.

burntsushi

15 hours ago

Show me code you've published that is used by real people in real systems that follows the philosophy you've espoused here. Otherwise I'm calling shenanigans.

jacquesm

14 hours ago

You are unnecessarily combative in this thread. I don't know what about the GP it is that ticks you off but they're making a lot of sense to me and I don't see why you would be loudly demanding published code when you are having a conversation about an abstract device.

burntsushi

13 hours ago

If you haven't read my blog on this topic, I suggest you do so before replying further: https://burntsushi.net/unwrap

It should very clearly state my position. And it provides the examples that I previously referenced.

The GP got a link to this blog in the previous HN thread. They dismissed it out-of-hand without engaging with it at all. And tossed in an ad hominem for good measure. So your issue with me specifically here seems completely inappropriate.

burntsushi

13 hours ago

I've presented examples. They haven't. They haven't even bothered to engage with the examples I've provided. I want to read code they've written using this philosophy so that I can see what it looks like in real world usage. Otherwise, the only code I've seen that does something similar uses formal methods. So I simply do not believe that this is practical advice for most programming.

Insisting on examples and evidence to support an argument isn't combative. It's appropriate when extraordinary claims are being made.

If you've published code using this philosophy that is used by real people in real systems, then I would be happy to take a look at that as well. If it exists, I would bet it's in a niche of a niche.

I've had these arguments before about this very topic. Some people have taken me up on this request and actually provided examples. And in 100% of those cases, it turned out there was a mismatch between what they were saying and what the code was doing.

zbentley

7 hours ago

> I work for a FAANG but I’m not going to say which one. You or a relative are, with almost 100% certainty, relying on code written to that philosophy, by me, daily and widely.

Cool story bro.

Like, even interpreted maximally charitably, your statement still doesn’t provide GP’s requested published code. Not “take my word for it” ostensibly deployed software—code; the discussion here is about code constructs for modeling invariants, not solely about runtime behavior.

I’d be interested to see that code discussed in context of the blog post GP linked, which seems to make a compelling argument.

frumplestlatz

7 hours ago

I am enjoined from providing that, and it’d be idiotic to risk my career for an HN ****-measuring contest. If one can’t understand these concepts without example code then this probably isn’t a discussion one can meaningfully contribute to.

Not being able to envision how it is in fact possible to write code with these invariants encoded in the type system is a fundamental fault in one’s ability to reason about this topic, and software correctness in general, in the first place.

zbentley

6 hours ago

> Not being able to envision how it is in fact possible to write code with these invariants encoded in the type system is a fundamental fault in one’s ability to reason about this topic, and software correctness in general, in the first place.

Code proving that it’s possible to avoid branching into an abort (the concept, not necessarily the syscall) was not what the original GP requested. Nor was a copy of your employer’s IP. Published examples which demonstrate how real-world code which intentionally calls panic() could be better written otherwise was my interpretation of the request.

And I’m requesting that, too, because I am interested in learning more about it! Please don’t assume I’m asking out of inexperience with safety critical systems, dick-measuring, faulty reasoning ability, or unfamiliarity with using type systems to avoid runtime errors (where—and this is the source of this discussion—practical and appropriate). If you work on your tone, that would make it much easier to have educating discussions in contexts like this.

dragonwriter

6 hours ago

> Result declares a type-level invariant — an assertion enforced by the compiler, not runtime — that the operation can fail.

“Can do X” is not an invariant. “Will never do X” (or “Will always do Y”) is an invariant. “Can do X” is the absence of the invariant “Will never do X”.

> Using `.unwrap()` is always an example of a failure to accurately model your invariants in the type system.

No, using .unwrap() provides a narrower invariant to subsequent code by choosing to crash the process via a panic if the Result contains an Error.

It may be a poor choice in some circumstances, and it may be a result of mistakenly believing that code returning the Result itself had failed to represent its invariants fully such that the .unwrap() would be a noop—but even there it respects and narrows the invariant declared, it doesn't ignore it—and, in any case, as it has well-defined behavior in either of the possible input cases, it is silly to describe using it as a failure accurately model invariants in the type system.

frumplestlatz

6 hours ago

“Narrowing” a compile-time invariant without a corresponding proof is formally unsound and does not “respect” the declared invariant in any reasonable sense.

What’s silly is the desire to pretend otherwise because it’s easier.

dragonwriter

6 hours ago

> “Narrowing” a compile-time invariant without a corresponding proof is formally unsound and does not “respect” the declared invariant in any reasonable sense

The invariant is that either condition X applies or condition Y applies. "Panic and stop execution if X, continue execution with the invariant Y if Y" is not unsound and does respect the original invariant in every possible sense.

It may be the wrong choice of behavior given the frequency of X occurring and the costs incurred by the decision to panic, but that’s not a type-level problem.

zbentley

6 hours ago

Formal verification is well and good, but that is not what unsoundness means.

If a proof trivially demonstrated that a given program’s behavior was indeed “proceed if a condition is satisfied, crash otherwise”, then what? Or do we not trust the verifier with branching code all of a sudden?

vablings

12 hours ago

Also, in this use case catching the panic and completely forgetting that the function was ever called in the first place is completely acceptable. In web frameworks such as Dioxus/Axum if your users request causes a panic it does not bring down the whole web server it just invalidates that specific request

antonvs

17 hours ago

And it's so easy to avoid, as well.

    #![deny(clippy::unwrap_used)]

    cargo clippy -- -D clippy::unwrap_used

Put that in your CI pipeline, and voila. Global crash averted.

echelon

17 hours ago

Rust needs to get rid of .unwrap() and its kin. They're from pre-1.0 Rust, before many of the type system features and error handling syntax sugar were added.

There's no reason to use them as the language provides lots of safer alternatives. If you do want to trigger a panic, you can, but I'd also ask - why?

Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures. Any function that then calls a panicky function needs to be similarly marked. In doing this, we can statically be certain no such methods are called if we want to be rid of the behavior.

Perhaps something like:

    panic fn my_panicky_function() {
      None.unwrap(); // NB: `unwrap()` is also marked `panic` in stdlib 
    }

    fn my_safe_function() {
      // with a certain compiler or Crates flag, this would fail to compile
      // as my_safe_function isn't annotated as `panic`
      my_panicky_function() 
    }

The ideal future would be to have code that is 100% panic free.

svieira

15 hours ago

All that means is that the `Failure` bubbles up to the very top of `main` (in this scenario) because we're only caring about the happy path (because we can't conceive of what the unhappy path should be other than "crash") and then hits the `panic("Well, that's unexpected")` explicitly in Place B rather than Place A (the `.unwrap`). I'm not sure how that's _better_.

jacquesm

15 hours ago

It would not because it would be a compile time error rather than run time error which is a completely different beast if I understand the argument correctly.

Dylan16807

11 hours ago

What would be a compile time error? The compiler rejecting unwrap? And then you fix that by bubbling the error case up, which fixes the compiler error and leaves you with a runtime error again. But one that's less ergonomic.

You can't force a config file loaded at run time to be correct at compile time. You can only decide what you're going to do about the failure.

jacquesm

11 hours ago

The point they are - trying, apparently - making is that if you had a flag or an annotation that you could make to a function that you do not want that function to be built on top of anything that can 'unwrap' that you can rule out some of these cases of unexpected side effects.

echelon

15 hours ago

Not really. Handler and middleware can handle this without much ceremony. The user gets to, and is informed of and encouraged to, choose.

We also don't get surprised at runtime. It's in the AST and we know at compile time.

The right API signature helps the engineer think about things and puts them in the correct headspace for systems thinking. If something is panicking under the hood, the thought probably doesn't even occur to them.

svieira

8 hours ago

Yes, but my point is that without a reasonable supervision tree and crash boundary the difference between a composition of Result-returning functions that bottoms out in main's implicit panic and an explicit panic is nil operationally.

While lexically the unwrap actually puts the unhandledness of the error case as close to the source of the issue's source as possible. In order to get that lexical goodness you'd need something much more fine grained than Result.

12_throw_away

16 hours ago

> There's no reason to use [panics] as the language provides lots of safer alternatives.

Dunno ... I think runtime assertions and the ability to crash a misbehaving program are a pretty important part of the toolset. If rust required `Result`s to be wired up up and down the entire call tree for the privilege of using a runtime assertion, I think it would be a lot less popular, and probably less safe in practice.

> Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures.

I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.

zbentley

7 hours ago

> a mechanism to prove that code can or cannot panic would be great

As appealing as the idea of a #[cfg(nopanic)] enforcement mechanism is, I think linting for panic() is the optimum, actually.

With a more rigidly enforced nopanic guarantee, I worry that some code and coders would start to rely on it (informally, accidentally, or out of ignorance) as a guarantee of completion, not return behavior. And that’s bad; adding language features which can easily be misconstrued to obscure the fact that all programs can terminate at any time is dangerous.

Lints, on the other hand, can be loud and enforced (and tools to recursively lint source-available dependencies exist), but few people mistake them for runtime behavior enforcement.

echelon

16 hours ago

> I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.

In one of the domains I work in, a malloc failure and OOMkill are equivalent. We just restart the container. I've done all the memory pressure measurement ahead of time and reasonably understand how the system will behave under load. Ideally it should never happen because we pay attention to this and provision with lots of overhead capacity, failover, etc. We have slow spillover rather than instantaneous catastrophe. Then there's instrumentation, metrics, and alerting.

A surprise bug in my code or a dependency that causes an unexpected panic might cause my application or cluster to restart in ways we cannot predict or monitor. And it can happen across hundreds of application instances all at once. There won't be advanced notice, and we won't have a smoking gun. We might waste hours looking for it. It could be as simple as ingesting a pubsub message and calling unwrap(). Imagine an critical service layer doing this all at once, which in turn kills downstream services, thundering herds of flailing services, etc. - now your entire company is on fire, everyone is being paged, and folks are just trying to make sense of it.

The fact is that the type of bugs that might trigger a user-induced panic might be hidden for a long time and then strike immediately with millions of dollars of consequences.

Maybe the team you implemented an RPC for six months ago changes their message protocol by flipping a flag. Or maybe you start publishing keys with encoded data center affinity bytes, but the schema changed, and the library that is supposed to handle routing did an unwrap() against a topology it doesn't understand - oops! Maybe the new version handles it, but you have older versions deployed that won't handle it gracefully.

These failures tend to sneak up on you, then happen all at once, across the entire service, leaving you with no redundancy. If you ingest a message that causes every instance to death spiral, you're screwed. Then you've got to hope your logging can help you find it quickly. And maybe it's not a simple roll back to resolve. And we know how long Rust takes to build...

The best tool for this surely can't be just a lint? In a supposedly "safe" language? And with no way to screen dependencies?

Just because somebody's use case for Rust is okay with this behavior doesn't mean everyone's tolerates this. Distributed systems folks would greatly appreciate some control over this.

All I'm asking for is tools to help us minimize the surface area for panics. We need as much control over this as we can get.

Dylan16807

11 hours ago

If you replace panic with a bespoke fallback or retry, have you really gained anything? You can still have all your services die at the same time, and you'll have even less of a smoking gun since you won't have a thousand stack traces pointing at the same line.

The core issue is that resilience to errors is hard, and you can't avoid that via choice of panic versus non-panic equivalents.

bigstrat2003

6 hours ago

Unwrap is not only fine, it's a valuable part of the language. Getting rid of it would be a horrible change. What needs to happen is not using an assert (which is really what unwrap is) if an application can't afford to crash.

jacquesm

16 hours ago

I'd say the equivalent of Erlang's supervisor trees is what is needed but once you go that route you might as well use Erlang.

zbentley

7 hours ago

I’m not sure that panic (speaking generally about the majority of its uses and the spirit of the law; obviously 100% of code does not obey this) is the equivalent of an Erlang process crash in most cases. Rather, I think unwrap()/panic are usually used in ways more similar to erlang:halt/1.

ViewTrick1002

16 hours ago

Or just deploy containers with an orchestrator restarting them when failing?

It is not like an Erlang service would be able to make progress with an invalid config either.

jacquesm

16 hours ago

That's fair, but even there the roll-back would be a lot smoother, besides the supervisor trees are a lot more fine grained than restarting entire containers when they fail.

lenkite

16 hours ago

What happens when they "keep" failing ? You never get to know what is causing your nightmare.

burntsushi

15 hours ago

I'm on libs-api. We will never get rid of unwrap(). It is absolutely okay to use unwrap(). It's just an assertion. Assertions appear in critical code all the time, including the standard library. Just like it's okay to use `slice[i]`.

echelon

15 hours ago

This is the Hundred Billion Dollar unwrap() Bug.

You can keep unwrap() and panics. I just want a static first class method to ensure it never winds up in our code or in the dependencies we consume.

I have personally been involved in nearly a billion dollars of outages myself and am telling you there are simple things the language can do to help users purge their code of this.

This is a Rust foot gun.

A simple annotation and compiler flag to disallow would suffice. It needs to handle both my code and my dependencies. We can build it ourselves as a hack, but it will never be 100% correct.

This is why I want it:

https://news.ycombinator.com/item?id=46060907

burntsushi

15 hours ago

You said:

> Rust needs to get rid of .unwrap() and its kin.

Now you say:

> You can keep unwrap() and panics.

So which is it?

> I just want a static first class method to ensure it never winds up in our code or in the dependencies we consume.

Now this is absolutely a reasonable request. But it's not an easy one to provide depending on how you go about it. For example, I'd expect your suggestion in your other comment to be a non-starter because of the impact it will have on language complexity. But that doesn't mean there isn't a better way. (I just don't know what it is.)

This is a classic motte and bailey. You come out with a bombastic claim like "remove unwrap and its ilk," but when confronted, you retreat to the far more reasonable, "I just want tools to detect and prevent panicking branches." If you had said the latter, I wouldn't have even responded to you. I wouldn't have even batted an eye.

> This is the Hundred Billion Dollar unwrap() Bug.

The Cloudflare bug wasn't even caused by unwrap(). unwrap() is just its manifestation. From a Cloudflare employee:

> In this case the unwrap() was only a symptom of an already bad state causing an error that the service couldn't recover from. This would have been as much of an unrecoverable error if it was reported in any other way. The mechanisms needed to either prevent it or recover are much more nuanced than just whether it's an unwrap or Result.

gishh

11 hours ago

> unwrap() was only a symptom of an already bad state causing an error that the service couldn't recover from. This would have been as much of an unrecoverable error if it was reported in any other way. The mechanisms needed to either prevent it or recover are much more nuanced than just whether it's an unwrap or Result.

This sounds like the kind of failure Bobby Tables warned about a long time ago. An entire new, safe language was developed to prevent these kinds of failures. “If it compiles it’s probably correct” seems to be the mantra of rust. Nuts.

burntsushi

11 hours ago

The fact that this wasn't RCE or anything other than denial of service is a raging success of Rust.

“If it compiles it’s probably correct” has always been a tongue-in-cheek pithy exaggeration. I heard it among Haskell programmers long before I heard it in the context of Rust. And guess what? Haskell programs have bugs too.

gishh

8 hours ago

> “If it compiles it’s probably correct” has always been a tongue-in-cheek pithy exaggeration.

If you say so, I believe you. That isn’t how it comes across in daily, granted pithy, discourse around here.

I have a lot of respect for you Andrew, not meaning to attack you per se. You surely can see the irony in the internet falling over because of an app written in rust, and all that comes with this whole story, no?

burntsushi

7 hours ago

Nope. Because you've completely mischaracterized not only the actual problem here, but the value proposition of Rust. You're tilting at windmills.

Nobody credible has ever said that Rust will fix all your problems 100% of the time. If that's what you inferred was being sold based on random HN commentary, then you probably want to revisit how you absorb information.

Rust has always been about reducing bugs, with a specific focus on bugs as a result of undefined behavior. It has never, isn't and will never be able to eliminate all bugs. At minimum, you need formal methods for that.

Rust programs can and will have bugs as a result of undefined behavior. The value proposition is that their incidence should be markedly lower than programs written in C or C++ (i.e., implementations of languages that are memory unsafe by default).

gishh

7 hours ago

> If that's what you inferred was being sold based on random HN commentary, then you probably want to revisit how you absorb information.

Heard, chef.

Dylan16807

11 hours ago

In a local sense, "quit out safely when the config is corrupt" is pretty correct.

Coordinated systems that test and rollback are way beyond the scope of what a compiler can check.

gishh

10 hours ago

What about “detect when the content isn’t correct and take protective measures so that a core service of the global internet _doesn’t_ crash?” Wasn’t that the whole point of rust? I’ll repeat again “if it compiles it is almost absolutely correct” is a mantra I see on hn daily.

Apparently that isn’t true.

Edit: isn’t the whole idea of C/C++ being flawed pivoted around memory management and how flawed the languages are? Wasn’t the whole point of rust to eliminate that whole class of errors? XSS and buffer overflows are almost always caused by “malformed” outside input. Rust apparently doesn’t protect against that.

Dylan16807

9 hours ago

If you corrupt memory, a huge variety of unpredictable bad things can happen.

If you exit, a known bad thing happens.

No language can protect you from a program's instructions being broken. What protective measures do you have in mind? Do they still result in the service ceasing to process data and reporting a problem to the central controller? The difference between "stops working and waits" and "stops working and calls abort()" is not much, and usually the latter is preferred because it sets off the alarms faster.

Tell me what specifically you want as correct behavior in this situation.

jacquesm

an hour ago

Ok, I'll take a stab at that:

I would expect such a critical piece of code to be able to hot-load and validate a new configuration before it is put into action. I would expect such a change to be rolled out gradually, or at least as gradually as required to ensure that it functions properly before it is able to crash the system wholesale.

I can't say without a lot more knowledge about the implementation and the context what the best tools would be to achieve this but I can say that crashing a presently working system because of a config fuckup should not be in the range of possible expected outcomes.

Because config fuckups are a fact of life so config validation before release is normal.

saati

11 hours ago

Who lost a hundred billion dollars?

ViewTrick1002

16 hours ago

I’ve been seeing you blazing this trail since the incident and it feels short sighted and reductive.

Rust is built on forcing the developer to acknowledge the complexity of reality. Unwrap acknowledges said complexity with a perfectly valid decision.

There are a few warts from early days like indexing and the ”as” operator where the easy path is doing the wrong thing.

But unwraps or expects are where Rust shines. Throwing up your hands is a perfectly reasonable response.

With your approach, what should Cloudflare have done?

Return an error, log it and return a 500 result due to invalid config? They could fail open, but then that opens another enormous can of worms.

There simply are no good options.

The issue rests upstream where deployments and effects between disparate services needs to be mapped and managed.

Which is a truly hard problem, rather than blaming the final piece throwing up its hand when given an invalid config.

echelon

16 hours ago

> I’ve been seeing you blazing this trail since the incident and it feels a short sighted and reductive.

Why is it inappropriate to be able to statically label the behavior?

Maybe I don't want my failure behavior dictated by a downstream dependency or distracted engineer.

The subject of how to fail is a big topic and is completely orthogonal to the topic of how can we know about this and shape our outcomes.

I would rather the policy be encoded with first class tools rather than engineering guidelines and runbooks. Let me have some additional control at what looks like to me not a great expense.

It doesn't feel "safe" to me to assume the engineer meant to do exactly this and all of the upstream systems accounted for it. I would rather the code explicitly declare this in a policy we can enforce, in an AST we can shallowly reason about.

ViewTrick1002

16 hours ago

How deep do you go? Being forced to label any function that allocates memory with ”panic”?

Right now you all the instances where the code can panic are labeled. Grep for unwrap, panic, expect etc.

In all my years of professional Rust development I’ve never seen a potential panic pass code review without a discussion. Unless it was trivial like trying to build an invalid Regex from a static string.

echelon

16 hours ago

Malloc is fair game.

Unwrap, slice access, etc. are not.

asa400

10 hours ago

You probably know about these, but for the benefit of folks who don't, you can forbid slice access and direct unwraps with clippy. Obviously this only lints your own code and not dependencies.

  - https://rust-lang.github.io/rust-clippy/master/#unwrap_used
  - https://rust-lang.github.io/rust-clippy/master/#indexing_slicing
  - https://rust-lang.github.io/rust-clippy/master/#string_slice

dpark

16 hours ago

So slicing is forbidden in this scheme? But not malloc?

This doesn’t seem to be a principled stance on making the language safer. It feels a bit whack-a-mole. “Unwrap is pretty easy to give up. I could live without slicing. Malloc seems hard though. I don’t want to give that up.”

echelon

16 hours ago

I posted about why this is important for distributed systems engineering:

https://news.ycombinator.com/item?id=46060907

Malloc is fine. We can and do monitor that. It's these undetectable runtime logic problems that are land mines.

In distributed systems, these can cause contagion and broad outages. Recovering can be very difficult and involve hours of complex steps across dozens of teams. Meanwhile you're losing millions, or even hundreds of billions, of dollars for you and your customers.

Someone unwrapping() a Serde wire message or incorrectly indexing a payload should not cause an entire fleet to crash. The tools should require the engineer handle these problems with language features such as Result<>.

Presently, who knows if your downstream library dependency unwrap()s under the hood?

This is a big deal and there could be a very simple and effective fix.

The Cloudflare outage was a multi-billion dollar outage. I have personally been involved in multiple hundred million dollar outages at fintechs, so forgive me for being passionate about this.

dpark

15 hours ago

I don’t actually work in Rust. I think I understand what you’re going for, though. The choice to use panic as a way of propagating errors is fundamentally problematic when it can arise from code you don’t control and potentially cannot even inspect.

I don’t necessarily agree that malloc should be okay (buggy code could try to allocate a TB of memory and OOMKiller won’t fix it) but I can understand that it’s probably workable in most cases.

Unfortunately I think the fix here would require a compatibility break.

ViewTrick1002

16 hours ago

And now the endless bikeshedding has begun.

Thanks for making abundantly clear how such a feature wouldn’t solve a thing.

echelon

16 hours ago

https://news.ycombinator.com/item?id=46060907

Copying this so you see it too -

The Cloudflare outage was a multi-billion dollar outage. I have personally been involved in multiple hundred million dollar outages at fintechs, so forgive me for being passionate about this.

Several of the outages I've been involved in were the result of NPEs or incorrectly processing runtime data. Rust has tools to enforce safety here, but it doesn't have tools to enforce your use of them. If also doesn't have a way to safeguard you from others deciding the behavior for you.

There is potentially a very easy set of non-onerous features we could build that allow us to prevent this.

sfink

4 hours ago

Except that the outage would still have happened without that .unwrap(). So go ahead and build those features, they sound useful, but don't think that they'd save you from a failure like this.

As the poster here said, the place to build in features that would have prevented this from happening is the DB schema and queries. 5NF would be onerous overkill here, but it seems reasonable to have some degree of forced normalization for something that could affect this much.

(Requiring formal verification of everything involved here would be overkilling the overkill, otoh.)

ottah

17 hours ago

I would argue the largest CDN provider in the world is a critical path.

rdtsc

17 hours ago

I would guess at the individual team level they probably still behave like any other tech shop. When the end of the year comes the higher-ups still expect fancy features and accomplishments and saying "well, we spent months writing a page of TLA+ code" is not going to look as "flashy" as another team who delivered 20 new features. It would take someone from above to push and ask that other team who delivered 20 features, where is their TLA+ code verifying their correctness. But, how many people in the middle management chain would do that?

lenkite

16 hours ago

We need modern programming languages with formal verification built-in - should be applicable to specially demarcated functions/modules. It is a headache to write TLA+ and keep the independent spec up2date with the productive code.

FloorEgg

16 hours ago

I agree with you.

I would just add that I've noticed organizations tend to calcify as they get bigger and older. Kind of like trees, they start out as flexible saplings, and over time develop hard trunks and branches. The rigidity gives them stability.

You're right that there's no way they could have gotten to where they are if they had prioritized data integrity and formal verification in all their practices. Now that they have so much market share, they might collapse under their own weight if their trunk isn't solid. Maybe investing in data integrity and strongly typed, functional programming that's formally verifiable is what will help them keep their market share.

Cultures are hard to change and I'm not suggesting an expectation for them to change beyond what is feasible or practical. I don't lead an engineering organization like it so I'm definitely armchairing here. I just see some of the logic of the argument that them adopting some of these methods would probably benefit everyone using their services.

ihaveajob

17 hours ago

Thank you for putting this in such clear terms. It really is a Catch-22 problem for startups. Most of the time, you can't reach scale unless you cut some corners along the way, and when you reach scale, you benefit from NOT cutting those corners.

SoftTalker

17 hours ago

Why is being able to "capture the market" something we want to encourage? This leads to monopolies or oligopolies and makes possible various types of abuse that a free competitive market would normally correct.

If you're going to step into the role of managing a large percentage of public internet traffic, maybe you need to be held to a different standard and set of rules than a startup trying to get a foothold among dozens or hundreds of other competitors. Something more like a public utility than a private enterprise.

xoa

17 hours ago

The three other replies you've gotten so far have given some generically applicable though still good answers, but I want to address something regard Cloudflare specifically: a major part of their entire core goal and value proposition revolves around being able to defend their customers from continuously scaling ever larger hostile attacks. This isn't merely a case of "natural selection" or what a company/VCs might desire, but that it's hard to see how under the current (depressing, shitty) state of the Internet it'd be possible to cheaply defend against terabit-plus class DDOS and the like without Cloudflare level scale in turn. And "cheaply" is in fact critical too because the whole point of resource exhaustion attacks is that they're purely economic, if it costs many times more to mitigate them then to launch and profit from them then the attackers are going to win in the end. Ideally we'd be solving this collective action problem collectively with standards amongst nations and ISPs to mitigate or eliminate botnets at the source, but we have to trundle along as best we can in the mean time right? I'm not sure there is room for a large number of players in Cloudflare's role, and they've been a pretty dang decent one so far.

immibis

17 hours ago

It doesn't matter what "we" "encourage". This is a natural selection process: all sorts of teams exist, and then the market decides to be captured by certain ones. We do not prescribe which attributes capture the market; we discover them.

xeromal

17 hours ago

I assume wanting a company to succeed is fundamental to hacker news. The world is better of with CF being around for sure

dzikimarian

17 hours ago

You would have to completely flip how funding works. As of now most VCs have abysmal returns, so heightening the bar is last thing on their mind.

swiftcoder

16 hours ago

> It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.

We could also invest in tooling to make this kind of thing easier. Unclear why humans need to hand-normalise the database schema - isn't this exactly the kind of thing compilers are good at?

TYPE_FASTER

16 hours ago

What I have seen work in the past is testing using a production backup as a final step prior to releasing, including applying database scripts. In this case, the permissions change would have been executed, the query would have run, and the failure would have been observed.

PunchyHamster

17 hours ago

I'd not be surprised if root of the issue was some engineer who didn't add DB selector because in other SQL engines SELECT like that would select from currently connected database vs all of them

femiagbabiaka

9 hours ago

database normalization and formal verification aren't on the same level of difficulty in terms of implementation, and we all could do the former from the beginning, if we choose to (nobody ever chooses to)

Dylan16807

11 hours ago

Formally verifying code is an enormous endeavor.

But a normalized database without NULL should not be a significant burden.

no_wizard

16 hours ago

Why is this inherently slower?

There’s for example, languages or features of languages that work entirely on not allowing these things.

I ask because I feel like I’m missing something

yearolinuxdsktp

16 hours ago

Not to mention that perfectly normalizing a database always incurs join overhead that limits horizontal scalability. In fact, denormalization is required to achieve scale (with a trade-off).

I’m not sure how formal verification would’ve prevented this issue from happening. In my experience, it’s unusual to have to specify a database name in the query. How could have formal verification covered this outcome?

The recommendations don’t make sense saying that the query needed DISTINCT and LIMIT. Don’t forget that the incoming data was different (r0 and default did not return the same exact data, this is why the config files more than doubled in size), so using DISTINCT would have led to uncertain blending of data, producing neither result and hiding the double-database read altogether. Secondly, LIMIT only makes sense to use in conjunction with a failure circuit breaker (if LIMIT items is returned, fail the query). When does it make business-logic sense to LIMIT the query-in-question’s result? And do you think the authors would have known how to set the LIMIT to not exceed the configuration file consumers’ limitations?

The article says: > “You can’t reliably catch that with more tests or rollouts or flags. You prevent it by construction—through analytical design.”

That’s the big design up front fallacy. Of course you can catch it reliably with more tests, and limit the damage with flags and rollouts. There’s zero guarantee that the analytical design would’ve caught this up front.

aishsh

17 hours ago

I’d be with you except that cloudflare prioritizes profit over doing a good job (layoffs, offshoring, etc). You don’t get to make excuses when you willingly reduced quality to keep your profits high.

uoaei

9 hours ago

> It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.

Never seen the amoral-capitalist argument to stunting progress at the expense of profit put so succinctly!

locknitpicker

17 hours ago

This sort of Monday morning quarterbacking is pointless and only serves as a way for random bloggers to try to grab credit without actually doing or creating any value.

nmoura

17 hours ago

I disagree. I learnt good stuff from this article and it’s enough.

locknitpicker

17 hours ago

> I disagree. I learnt good stuff from this article and it’s enough.

That's perfectly fine. It's also besides the point though. You can learn without reading random people online cynically shit talking others as a self promotion strategy. This is junior dev energy manifesting junior level understanding of the whole problem domain.

There's not a lot to learn from claims that boil down to "don't have bugs".

engineeringwoke

16 hours ago

I laughed out loud when he said Cloudflare should have formally verified its systems.

DrSusanCalvin

16 hours ago

Not to single you out in particular, but I see this sentiment among programmers a lot and to me it's akin to a structural engineer saying "I laughed out loud when he said they should analyze the forces in the bridge".

rcxdude

2 hours ago

more like "I laughed out loud when he said they should FEM the whole structure, down to the last bolt and strand of cable".

(More seriously, 'formal verification' is not a single thing, more a class of techniques which allow you to statically guarantee some properties of the system you are working with. When you propose it, you should have a clear idea of what properties you care about and how you intend to prove them, as well as a strong concern about whether those properties are actually going to capture enough of what you care about for it to be worthwhile)

engineeringwoke

16 hours ago

You can't formally verify anything that uses consensus, which is the backbone of the entire web. It's a complete non-starter.

DrSusanCalvin

16 hours ago

Care to elaborate? Perhaps the tools to do this in practice aren't there (which just shows how young the field of software "engineering" really is), but what consensus are you talking about and how is it an obstacle to verifying code? Most of the web follows standards and protocols, which actually sort of a prerequisite for communications across different systems...

engineeringwoke

16 hours ago

Basically the modern web uses orchestration, for pretty much everything. Usually Kubernetes is doing that. Theoretically protocols like RAFT are formally verifiable, but their implementations in orchestration tools like etcd have not been, and I would go so far as to say that that is an impossible task. Therefore, the entire exercise is kind of silly.

DrSusanCalvin

15 hours ago

Thanks, interesting. However, that just seems like a protocol like any other, with no real reason why you "can't" formally verify it. Is there something special about a consensus algorithm / protocol that makes it any more difficult to verify than any other algorithm which doesn't yet have a formally verified implementation?

Edit: https://link.springer.com/chapter/10.1007/978-3-319-48989-6_...

engineeringwoke

15 hours ago

That would be like saying that you can verify the software that CERN uses to measure the Higgs Boson because we verified general relativity.

DrSusanCalvin

14 hours ago

> You can't formally verify anything that uses consensus

What did you mean by this then? There certainly seems to be nothing special about consensus that makes it any harder to verify than anything else. It's not fundamentally impossible to verify the software that CERN uses, it just takes some work.

orionometer

16 hours ago

A bridge failing is a high likelihood of death or serious injury. How many people died or were seriously injured in the latest Cloudflare outage?

For life or death systems, I agree that we should be looking to implement analogous processes/systems to a structural engineer or doctor, etc. Cloudflare is not a life or death system. If you operate a life or death system and you have Cloudflare as a single point of failure, for some reason, that should not be Cloudflare's problem.

DrSusanCalvin

2 hours ago

Sure, but this sentiment is why software "engineering" isn't really. You can justify it by not being important enough for actual engineering practices I guess, but to me it's a lack of pride in and care of your product.

hatthew

9 hours ago

> How many people died or were seriously injured in the latest Cloudflare outage?

I would not be surprised if the answer is "several". The average impact per human is obviously pretty small, but across billions of humans, there will be outliers.

Maybe a fire department uses a coordination system that relies on cloudflare, and with cloudflare down they have to resort to their backup system, and their backup system works but is slightly worse and causes one engine to be delayed in their response, and because they're 3 minutes late, they just miss being able to save someone from the fire.

Maybe someone's running a scientific study on nutrition, and the cloudflare outage means their data collection system is goes down for a bit, so their data flawed, and they end up just barely not passing a some necessary threshold, and they have to rerun their study, and that takes an extra week, and then they miss that quarter's deadline, and then the resulting adjustment to a product/procedure is delayed, and that 3 month delay causes 100,000 people to be slightly more malnourished than they would be otherwise, and one of those people ends up just barely too unhealthy to survive an unrelated deadly illness.

Sure, these scenarios are far-fetched. The chance of if it happening is one-in-a-million.

There are 10000 one-in-a-million people on the earth.

rvnx

17 hours ago

It's very similar to LinkedIn posts, where everybody seems to know better than the people actually running the platforms.

DrSusanCalvin

16 hours ago

This article actually explains how this bug in particular could have been avoided. Sure you may not consider his approach realistic, but it's not at all saying "don't have bugs". In fact, not having formal verification or similar tooling in place, would be more like saying "just don't write buggy code".

locknitpicker

15 hours ago

> This article actually explains how this bug in particular could have been avoided.

Not really. The article is a textbook example of hindsight bias. It's a simplistic analysis of a far more complex problem that goes over the blogger's head, and results in a string of simplistic assertions that fail to address any of the issues. Read up on the definition of monday morning quarterback.

DrSusanCalvin

2 hours ago

Read up on the value of snarky and dismissive comments spouting simplistic cliches.

galleywest200

17 hours ago

> You can learn without reading random people online

Somebody has to write something in the first place for one to learn from it, even if the writing is disagreeable.

locknitpicker

16 hours ago

You failed to cite the comment you were replying to.

The comment is:

> You can learn without reading random people online cynically shit talking others as a self promotion strategy.

alpinisme

17 hours ago

Not commenting on the quality of this post but occasional writing that responds to an event provides a good opportunity to share thoughts that wouldn’t otherwise reach an audience. If you post advice without a concrete scenario you’re responding to, it’s both less tangible for your audience and less likely to find an audience when it’s easier to shrug off (or put off).

oivey

17 hours ago

What did you learn? The suggestions in the post seem pretty shallow and non-actionable.

udev4096

16 hours ago

Backdooring the internet is certainly a productive venture!

notepad0x90

11 hours ago

You may have missed the point of HN.

echelon

17 hours ago

Like your comment? j/k :)

I'm using this incident to draw attention to Rust's panic behavior.

Rust could use additional language features to help us write mostly panic-free* code and statically catch even transitive dependencies that might subject us to unnecessary panics.

We've been talking about it on our team and to other Rust folks, and I think it's worth building a proposal around. Rust should have a way to statically guarantee this never happens. Opt-in at first, but eventually the default.

* with the exception of malloc failures, etc.

tracker1

17 hours ago

It's already in the box... there's a bunch of options from unwrap_or, etc... to actually checking the error result and dealing with it cleanly... that's not what happened.

Not to mention the possibility of just bumping up through Result<> chaining with an app specific error model. The author chose neither... likely because they want the app to crash/reload from an external service. This is often the best approach to an indeterminate or unusable state/configuration.

echelon

16 hours ago

> This is often the best approach to an indeterminate or unusable state/configuration.

The engineers had more semantic tools at their disposal for this than a bare `unwrap()`.

This was a systems failure. A better set of tools in Rust would have helped mitigate some of the blow.

`unwrap()` is from pre-1.0 Rust, before many of the type system-enabled error safety features existed. And certainly before many of the idiomatic syntactic sugars were put into place.

I posted in another thread that Rust should grow annotation features to allow us to statically rid or minimize our codebase of panic behavior. Outside of malloc failures, we should be able to constrain or rid large classes of them with something like this:

    panic fn my_panicky_function() {
      None.unwrap(); // NB: `unwrap()` is also marked `panic` in stdlib 
    }

    fn my_safe_function() {
      // with a certain compiler or Crates flag, this would fail to compile
      // as my_safe_function isn't annotated as `panic`
      my_panicky_function() 
    }

Obviously just an idea, but something like this would be nice. We should be able to do more than just linting, and we should have tools that guarantee transitive dependencies can't blow off our feet with panic shotguns.

In any case, until something is done, this is not the last time we'll hear unwrap() horror stories.

tracker1

16 hours ago

What you're suggesting is perfectly reasonable, I wouldn't object to labeling methods that can panic via bare unwrap...

I'm just saying that having a program immediately exit (via panic or not) could very well be the appropriate behavior.

hodgesrm

16 hours ago

> I base my paragraph on their choice of abandoning PostgreSQL and adopting ClickHouse(Bocharov 2018). The whole post is a great overview on trying to process data fast, without a single line on how to garantee its logical correctness/consistency in the face of changes.

I'm completely mystified how the author concludes that the switch from PostgreSQL to ClickHouse shows the root of this problem.

1. If the point is that PostgreSQL is somehow more less prone to error, it's not in this case. You can make the same mistake if you leave off the table_schema in information_schema.columns queries.

2. If the point is that Cloudflare should have somehow discovered this error through normalization and/or formal methods, perhaps he could demonstrate exactly how this would have (a) worked, (b) been less costly than finding and fixing the query through a better review process or testing, and (c) avoided generating other errors as a side effect.

I'm particularly mystified how lack of normalization is at fault. ClickHouse system.columns is normalized. And if you normalized the query result to remove duplicates that would just result in other kinds of bugs as in 2c above.

Edit: fix typo

cmckn

17 hours ago

I agree it should not have happened, but I don’t agree that the database schema is the core problem. The “logical single point of failure” here was created by the rapid, global deployment process. If you don’t want to take down all of prod, you can’t update all of prod at the same time. Gradual deployments are a more reliable defense against bugs than careful programming.

btown

17 hours ago

One of the things I find fascinating about this is that we don't blink twice about the idea that an update to a "hot" cache entry that's "just data" should propagate rapidly across caches... but we do have change management and gradual deployments for code updates and meaningful configuration changes.

Machine learning feature updates live somewhere in the middle. Large amounts of data, a need for unsupervised deployment that can react in seconds, somewhat opaque. But incredibly impactful if something bad rolls out.

I do agree with the OP that the remediation steps in https://blog.cloudflare.com/18-november-2025-outage/#remedia... seem undercooked. But I'd focus on something entirely different than trying to verify the creation of configuration files. There should be real attention to: "how can we take blue/green approaches to allowing our system to revert to old ML feature data and other autogenerated local caches, self-healing the same way we would when rolling out code updates?"

Of course, this has some risk in Cloudflare's context, because attackers may very well be overjoyed by a slower rollout of ML features that are used to detect their DDoS attacks (or a rollout that they can trigger to rollback by crafting DDoS attacks).

But I very much hope they find a happy medium. This won't be the last time that a behavior-modifying configuration file gets corrupted. And formal verification, as espoused by the OP, doesn't help if the problem is due to a bad business assumption, encoded in a verified way.

yodon

17 hours ago

>Gradual deployments are a more reliable defense against bugs than careful programming

The challenge, as I understand it, is that the feature in question had an explicit requirement of fast, wide deployment because of the need to react in real time to changing external attacker behaviors.

cmckn

16 hours ago

Yeah, I don’t know how fast “fast” needs to be in this system; but my understanding is this particular failure would have been seen immediately on the first replica. The progression could still be aggressive after verifying the first wave.

packetslave

12 hours ago

yep, and it was this exact requirement that also caused the exact same outage back in 2013 or so. DDoS rules were pushed to the GFE (edge proxy) every 15 seconds, and a bad release got out. Every single GFE worldwide crashed within 15 seconds. That outage is in the SRE book.

wnevets

17 hours ago

> Gradual deployments are a more reliable defense against bugs than careful programming.

Wasn't this one of the key takeaways from the crowdstrike outage?

tptacek

17 hours ago

Cloudflare doesn't seem to have called it a "Root Cause Analysis" and, in fact, the term "root cause" doesn't appear to occur in Prince's report. I bring this up because there's a school of thought that says "root cause analysis" is counterproductive: complex systems are always balanced on the precipice of multicausal failure.

Analemma_

17 hours ago

When I was at AWS, when we did postmortems on incidents we called it "root cause analysis", but it was understood by everyone that most incidents are multicausal and the actual analyses always ended up being fishbone diagrams.

Probably there are some teams which don't do this and really do treat RCA as trying to find a sole root cause, but I think a lot of "getting mad at RCA" is bikeshedding the terminology, and nothing to do with the actual practice.

tptacek

17 hours ago

Right, I'm not a semantic zealot on this point, but the post we're commenting on really does suggest that the Cloudflare incident had a root cause in basic database management failures, which is the substantive issue the root-cause-haters have with the term.

otterley

9 hours ago

The layered-swiss-cheese model of understanding incidents tends to map to the real world better than the alternatives.

otterley

9 hours ago

These days we tend to spend more time thinking about the "5 whys" (which often turn into more than 5) than the root cause itself. It's much more productive and useful.

cyberax

17 hours ago

> to find a sole root cause

"Six billion years ago the dust around the young Sun coalesced into planets"

luhn

17 hours ago

"Workaround: If we wait long enough, the earth will eventually be consumed by the sun."

https://xkcd.com/1822/

parados

17 hours ago

True, and I agree, but from their report they do seem to be doing Root Cause Analysis (RCA) even if they don't call it that.

RCA is a really bad way of investigating a failure. Simply put; if you show me your RCA I know exactly where you couldn't be bothered to look any further.

I think most software engineers using RCA confuse the "cause" ("Why did this happen") with the solution ("We have changed this line of code and it's fixed"). These are quite different problem domains.

Using RCA to determine "Why did this happen" is only useful for explaining the last stages of an accident. It focuses on cause->effect relationships and tells a relatively simple story but one that is easy to communicate - Hi there managers and media! But RCA only encourages simple countermeasures which will probably be ineffective and will be easily outrun by the complexity of real systems

However one thing RCA is really good at is allocating blame. If your organisation is using RCA then, what ever you pretend, your organisation has a culture of blame. With a blame culture (rather than a reporting culture) your organisation is much more likely to fail again. You will lack operational resilience.

PunchyHamster

17 hours ago

then rename it to "root causes analysis"

nine_k

17 hours ago

* The unwrap() in production code should have never passed code review. Damn, it should have been flagged by a linter.

* The deployment should have followed the blue/green pattern, limiting the blast radius of a bad change to a subset of nodes.

* In general, a company so much at the foundational level of internet connectivity should not follow the "move fast, break things" pattern. They did not have an overwhelming reason to hurry and take risks. This has burned a lot of trust, no matter the nature of the actual bug.

kalkin

16 hours ago

Unless you work at Cloudflare it seems very unlikely that you have enough information about systems and tradeoffs there to make these flat assertions about what "should have" happened. Systems can do worse things than crashing in response to unexpected states. Blue/green deployment isn't always possible (eg due to constrained compute resources) or practical (perhaps requiring greatly increased complexity), and is by no means the only approach to reducing deploy risk. We don't know that any of the related code was shipped with a "move fast, break things" mindset; the most careful developers still write bugs.

Actually learning from incidents and making systems more reliable requires curiosity and a willingness to start with questions rather than mechanically applying patterns. This is standard systems-safety stuff. The sort of false confidence involved in making prescriptions from afar suggests a mindset I don't want anywhere near the operation of anything critical.

nine_k

14 hours ago

Indeed, I never worked at Cloudflare. Still I have some nebulous idea about Cloudflare, and especially their scale.

Systems can do worse things than crashing in response to unexpected states, but they can also do better to report them and terminate gracefully. Especially if the code runs on so many nodes, and the crash renders them unresponsive.

Blue/green deployment isn't always possible, but my imagination is a bit weak, and I cannot suggest a way to synchronously update so many nodes literally all over the internet. A blue/green deployment happens in large distributed systems willy-nilly. It's better when it happens in a controlled way, and the safety of a change that affects basically the entire fleet is tested under real load before applying it everywhere.

I do not even assume that any of Cloudflare's code was ever shipped with the "move fast, break things" mindset; I only posit that such a mindset is not optimal for a company in the Cloudflare's position. Their motto might rather be "move smooth, never break anything"; I suppose that most of their customers value their stability higher than their speed of releasing features, or whatnot.

Starting with questions is a very right way, I agree. My first question: why calling unwrap() might ever be a good idea in production code, and especially in some config-loading code, which, to my mind, should be resilient, and ready to handle variations in the config data gracefully? Certain mechanical patterns, like "don't hit your finger with a hammer", are best applied universally by default, with the rare exceptional cases carefully documented and explained, not the other way around.

kalkin

11 hours ago

I appreciate that this comment is much less prescriptive. I don't think I disagree with you about any general best practices here (although I do think unwrap can be fine when you can locally verify the error or nil case is unreachable but proving that to the compiler is impractical.)

whazor

17 hours ago

The scale of the outage was so big and global, that the biggest failure was indeed the blast radius.

hnthrowaway0328

17 hours ago

I wish they do burn a lot of trust to show up in their financial reports. Otherwise it is like "we do not like it but gonna use it anyway".

spwa4

15 hours ago

* The step in front of this query created updates to policies. It should have been limited in the number of changes it would do at once (and ideally per hour and per day and so on), and if it goes over that limit, stop updating, alert and wait until explicitly unblocked. DO NOT generate invalid config and start using that invalid config, use the previous one that worked and alert.

If this happens during startup use a default one.

That would still create impact (customers and developers would not see updates propagate), but would avoid destroying the service. When it comes to outages, people need to learn to go over what happens in the case of violating an invariant and look at what gets sacrificed in those cases, to make sure the answer isn't "the whole service".

If I get to be impolite, you do this because software architects, as seems to be the case here, often choose "crash and destroy the service" when their invariants are violated instead of "stop doing shit and alert" when faced with an unknown problem, or a problem they can't deal with.

This also requires test-crashing. You introduce an assert? Great! The more the merrier, seriously, you should have lots of them. BUT you will be including a test that the world doesn't end when your assert is hit.

zahlman

17 hours ago

> the blue/green pattern

EvanAnderson

16 hours ago

This specific terminology was new to me, too: https://en.wikipedia.org/wiki/Blue%E2%80%93green_deployment

echelon

17 hours ago

unwrap() and the family of methods like it are a Rust anti-pattern from the early days of Rust. It dates back to before many of the modern error-handling and safety-conscious features of the language and type system.

Rust is being pulled in so many different directions from new users that the language perhaps never originally intended. Some engineers will be fine with panicky behavior, but a lot of others want to be able to statically guarantee most panics (outside of perhaps memory allocation failures) cannot occur.

We need more than just a linter on this. A new language feature that poisons, marks, or annotates methods that can potentially panic (for reasons other than allocation) would be amazing. If you then call a method that can panic, you'll have to mark your own method as potentially panicky. The ideal future would be that in time, as more standard library and 3rd party library code adopts this, we can then statically assert our code cannot possibly panic.

As it stands, I'm pretty mortified that some transitive dependency might use unwrap() deep in its internals.

asa400

9 hours ago

> unwrap() and the family of methods like it are a Rust anti-pattern from the early days of Rust. It dates back to before many of the modern error-handling and safety-conscious features of the language and type system.

I think your argument would be more effective if you dropped this angle.

Unwrap is assert. No more, no less. It's tremendously useful to have it in the language for situations when the cost of encoding some invariant in your program's types is far larger than the benefit you'd gain from doing so. It's not some vestigial thing from way back before anyone received enlightenment that they could use sum types to discriminate errors. It's just a different tool.

I completely agree there are systems and situations where you want to be able to statically verify an absence of panics, but the way you're describing the situation sounds similar to when I hear folks describe anything that came before as "legacy" with a denigrating inflection.

burntsushi

11 hours ago

> As it stands, I'm pretty mortified that some transitive dependency might use unwrap() deep in its internals.

You'll have to go without std and even the `core` library then.

Ysraes

11 hours ago

This is starting to sound a lot like checked exceptions in Java.

bigstrat2003

5 hours ago

Checked exceptions were a great idea which are still, to this day, unfairly maligned.

Ysraes

4 hours ago

I don't necessarily disagree, I wasn't making a judgement positively or negatively.

fwip

11 hours ago

Does #[no_panic] do it for you? https://docs.rs/no-panic/latest/no_panic/

1a527dd5

12 hours ago

Unless you work at Cloudflare or have worked at Cloudflare I'm not sure opinions like this help.

You don't know the context, you don't know _anything_ except for what Cloudflare chooses to share.

There are very few companies who deal with the kind of load that Clouldflare does, I dread to think what weird edges cases they've run into because of their sheer scale.

IshKebab

12 hours ago

Casually suggesting formally verifying the software too.

this_user

17 hours ago

Of course it shouldn't have happened. But if you run infrastructure as complex as this on the scale that they do, and with the agility that they need, then it was bound to happen eventually. No matter how good you are, there is always some extremely unlikely chain of events that will lead to a catastrophic out. Given enough time, that chain will eventually happen.

hvb2

17 hours ago

> A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.

In a database, you wouldn't solve this with a distinct or a limit? You would make the schema guarantee uniqueness?

And yes, that wouldn't deal with cross database queries. But the solution here is just the filter by db name, the rest is table design.

jrm4

17 hours ago

Nothing in this thread about "this should not have happened because Cloudflare is too centralized?"

We have far better ideas and working prototypes in terms of how to prevent this from happening again to be up here trying to "fix Cloudflare."

Think bigger, y'all.

PunchyHamster

17 hours ago

> but it clearly needs a distinct and a limit, since these seem to be crucial business rules.

Isn't that just... wrong ? Throwing arbitrary limit (vs maybe having some alert when the table is too long) would just silently truncate the list

Anybody can be backseat engineer by throwing out industry's best practices like they were gospel but you have to look at entire system, not just the database part

ku1ik

2 hours ago

Also please appreciate how fast this site is. The average website bloat is imperceptible until you open a page like this.

kjuulh

17 hours ago

It did happen, and cloudflare should learn from it, but not just the technical reasons.

Instead of focusing on the technical reasons why, they should answer how such a change bubbled out to cause such a massive impact instead.

Why: Proxy fails requests

Why: Handlers crashed because of OOM

Why: Clickhouse returns too much data

Why: A change was introduced causing double the amount of data

Why: A central change was rolled out immediately to all cluster (single point of failure)

Why: There are exemptions or standard operating procedure (gate) for releasing changes to the hot path for cloudflares network infra.

While the Clickhouse change is important, I personally think it is crucial that Cloudflare tackles the processes, and possibly gates / controls rollout for hot path system, no matter what kind of change they are when they're at that scale it should be possible. But that is probably enough co-driving. It to me seems like a process issue more than a technical one.

lysace

17 hours ago

Very quick rollout is crucial for this kind of service. On top of what you wrote, institutionalizing rollback by default if something catastrophically breaks should be the norm.

Been there in those calls, begging to people in charge who perhaps shouldn't have been, "eh, maybe we should attempt a rollback to the last known good state? cause, it, you know.... worked". But investigating further before making any change always seems to be the preferred action to these people. Can't be faulted for being cautious and doing things properly, right? I kid you not - this is their instinct.

If I recall correctly it took CF 2 hours to roll back the broken changes.

So if I were in charge of Cloudflare (4-5k employees) I'd both look at the processes and the people in charge.

vablings

12 hours ago

It does seem insane to me that there isnt a process to catch the panic, unwind back to a reasonable place in the call stack, load the last known good configuration and continue execution as normal. You would go from having a global 2 hour outage to a warning on a dashboard that can be investigated in a timely manner rather than blowing up half the internet

mvkel

6 hours ago

This piece feels a lot like someone criticizing an umpire's call after watching the slo-mo fifteen times and concluding the ball was actually a strike.

Way different from the umpire's pov

notepad0x90

11 hours ago

The real RCA (IMHO) is not simulating outages in production as part of reliability engineering.

Whatever process was stuck in a loop, crashed, or whatever service (db, dns,etc..) was unavailable, that outage scenario can be simulated. Changes can have an automated rollback requirement.

My take away is that CF has single points of failure they're aware of, and for business reasons, they've decided to not have a redundancy/failover.

> ...and formally verified code, this bug would not have happened.

That's what I mean, "we should have caught the bug" , yeah, but that isn't reliability engineering. You assume there will be bugs/outages and prepare for them instead. What happens if the entire DB entered a weird state and was spitting out valid results with incorrect values? What happens if it accepts connections and just stalls?

You prepare for bugs that don't yet exist, you fix bugs that do exist.

jmull

16 hours ago

I think the author is trying to apply a preconceived cause on to the cloudflare outage, but there’s not a fit.

E.g., they should try to work through how their own suggested fix would actually ensure the problem couldn’t happen. I don’t believe it would… lack of nullable fields and normalization typically simplify relational logic, but hardly prevent logical errors. Formal verification can prove your code satisfies a certain formal specification, but doesn’t prove your specification solves your business problem (or makes sense at all, in fact).

testemailfordg2

11 hours ago

https://blog.cloudflare.com/18-november-2025-outage/

"Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."

This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.

I hope it was not because of AI driven efficiency gains.

2d8a875f-39a2-4

16 hours ago

TFA has a point that it should never have happened, and that CF software engineering practices are likely to blame.

But a BCNF (or 5NF or whatever) database without nullable columns wouldn't have prevented it. Formally verified code might have but that remains a pipe dream for any significant code base.

The proposed cure is worse than the disease.

necovek

16 hours ago

I have to disagree on the tests not potentially helping here. Finding the right abstraction layer is hard, but there was obviously no integration test that tested wherever the original query was being constructed and where the output was being used. A single smoke test would have failed the same way their actual infra failed when the change was introduced.

Obviously, that's not to say that writing normalized database schemas and formal specification won't reduce the number of problems you will introduce. But people make mistakes anywhere, which could have been the case here with the query even if the DB was in a NF (and it still could have been in their case), or in the formal spec as well.

There is no magic bullet for correctness, unfortunately.

avereveard

11 hours ago

"If only the world was perfect the world would be perfect"

Author fails to mention how to actually formally verify this asynchronous globally replicated product. He may have solved the delivery theorem and if that's so I encourage him sharing the results.

> No nullable fiels.

Author appears to have not formally verified his post's grammar.

juujian

17 hours ago

They are not going as far as to blame PostgreSQL, but their switch to ClickHouse seems to suggest that they see PostgreSQL as part of the equation. Would ClickHouse really prevent this type of error from occurring? PostgreSQL already has so many options for setting up solid constrains for data entry. Or do they not have anyone on the team anymore (or never had) who could set up a robust PostgreSQL database? Or are they just piggybacking on the latest trend?

mediumsmart

4 hours ago

Cloudflare is actually an internet outage waiting to happen.

yakovsi

16 hours ago

Adding distinct or group by to a query is not some advanced technic comments are suggesting. It does not slow down development one bit, if you expect distinct result you put explicit distinct in the query, it's not a "safety measure for insulin pumps". Scratching my head what I've missed here, please enlighten me.

anonymars

11 hours ago

DISTINCT would just be masking the query bug

Random DISTINCT is usually a code smell that indicates an incorrect join / filter

linsomniac

16 hours ago

I'd be wanting to have some sort of a "dry run" on the produced artifact by the rust code consuming it, or a deploy to some sort of a test environment before letting it roll out to production. I've been surprised that no mention of that sort of thing in the Cloudflare after-action or here.

pizlonator

17 hours ago

> No nullable fiels.

If you take away nullability, you eventually get something like a special state that denotes absence and either:

- Assertions that the absence never happens.

- Untested half-baked code paths that try (and fail) to handle absence.

> formally verified

Yeah, this does prevent most bugs.

But it's horrendously expensive. Probably more expensive than the occasional Cloudflare incident

k3vinw

17 hours ago

I was expecting a critique on the centralized nature of the infrastructure and the fragility that comes with it.

foresto

12 hours ago

Do you mean Cloudflare's design, or the widespread reliance on Cloudflare?

I was hoping for a critique of the latter.

9cb14c1ec0

13 hours ago

As an aside, I find it really interesting how Cloudflare has morphed from CDN/DDOS protection into a services conglomerate that many startups could use for every compute need they have.

ruuda

17 hours ago

Sure, a different database schema may have helped, but there are going to be bugs either way. In my view a more productive approach is to think about how to limit the blast radius when things inevitably do go wrong.

nyrikki

17 hours ago

Hindsight bias is always easier but:

> FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.

That relational rigor imposes what one chooses to be true, it isn’t a universal truth.

The frame problem and the qualification problem apply here.

The open domain frame problem == HALT.

When you can for a problem into the relational model things are nice but not everything can be reduced to a trivial property.

That is why Codd had to as nulls etc..

You can choose to decide that the queen is rich OR pigs can fly; but a poor queen doesn’t result in flying pigs.

Choice over finite sets == finite indexes over sets == PEM

If you can restrict your problems to where the Entscheidungsproblem is solvable you can gain many benefits

But it is horses for courses and sub TC.

nyrikki

16 hours ago

I expect the downvotes here but it is important.

It doesn’t matter if you get there through Trakhtenbrot or Rice.

Codd’s normal form is a projection, it will turn your fancy model logic into classic logic.

IMHO it is always something to look for to use as a default, but fails if it is a hard requirement.

One classic way to describe the problem is the White king and Alice.

> ‘I see nobody on the road,’ said Alice.

> ‘I only wish I had such eyes,’ the King remarked in a fretful tone. ‘To be able to see Nobody! And at that distance, too! Why, it’s as much as I can do to see real people, by this light!’

Codd added nulls to handle unknowns or missing data.

The proper use of them is a complex subject. But they are required if you care about semantic correctness and not just logical validity in many cases.

Diaconescu-Goodman-Myhill theorem[0] will show the equivalence between PEM, finite indexes, and choice

[0] https://ncatlab.org/nlab/show/Diaconescu-Goodman-Myhill+theo...

ramon156

16 hours ago

While this blog post is pretty useless, it's a hell of a lot better than the LinkedIn posts about the outage... my god, I wish the "Not interested" button worked.

block_dagger

16 hours ago

I initially read the title as "Cloudflare outrage.." and I was thinking how nice someone is thinking of the poor engineers who crashed the Internet.

RenThraysk

17 hours ago

Would be interesting to see the DDL of the table, to see if it had unique constraints.

The query not utilising an unique constraint/index should have raised a red flag.

mikece

17 hours ago

Yes, pretty basic looking mistakes that, from the outside, make many wonder how this got through. Though analyzing the post-mortem makes me think of the MV Dali crashing into the Francis Scott Key bridge in Baltimore: the whole thing started with a single loose wire which set off a cascading failure. CF's situation was similar in a few ways though finding a bad query (and .unwrap() in production code rather than test code) should have been a lot easier to spot.

Have any of the post-mortems addressed if any of the code that led to CloudFlare's outage was generated by AI?

bell-cot

17 hours ago

> ...makes me think of the MV Dali crashing...

Yes. Though compared to Cloudflare's infrastructure, the Dali is a wooden rowboat. And CF doesn't have the "...or people will die" safety criticality.

jacquesm

17 hours ago

> And CF doesn't have the "...or people will die" safety criticality.

I disagree with that. Just because you can't point to people falling off a bridge into the water doesn't mean that outages of the web at this scale will not lead to fatalities.

bell-cot

15 hours ago

Technically true.

OTOH...whether you describe it as regulations, an SLA, or otherwise - "150,000 ton freighter destroys a major bridge and kills people" is a far worse violation of expected behavior than "lots of web sites went down".

jacquesm

14 hours ago

I see where people use CF and I actually think that 'lots of websites went down' has the potential these days to in aggregate kill far more people than were killed by the Dali losing control over their helm. The Dali accident could also have been avoided by simply requiring ships with the gross tonnage to do damage to the bridge to have mandatory tugs, and I'm not so sure there is a clean and effective solution for the kind of issues that CF can create.

They're more like 'the shipping industry' than they are like 'a single out of control vessel'. Keep in mind that half of the health care industry or more uses CF to protect their assets.

aforwardslash

17 hours ago

rolls eyes

No, their error was that they shouldn't be querying system tables to perform field discovery; the same method in postgresql (pg_class or whatever its called) would have had the same result. The simple alternative is to use "describe table <table_name>".

On top of that, they shouldn't be writing ad-hoc code to query system tables, but having a separate library instead to perform those kind of task mixed with business logic (crappy application design).

Also, this should never have passed code review in the first place, but lets assume it did because errors happen, and this kind of atrocious code and flaky design is not uncommon.

As an example, they could be reading this data from CSV files *and* have made the same mistake. Conflating this with "database design errors" is just stupid - this is not a schema design error, this is a programmer error.

devy

12 hours ago

Author's real cause prevention notes.

> 1. No nullable fiels.

Is that a typo there? fiels should be fields?

See for example https://danluu.com/algorithms-interviews/. This sort of thing happens constantly.

knorker

12 hours ago

No, this is nonsense and look like university student naivety.

What caused it was rolling out a change and moving on to the next recipient without checking if the previous task instantly died.

You can't prevent all crash bugs, but you can check if you are lasering your whole prod.