ojosilva
11 hours ago
This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.
I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.
abalone
6 hours ago
I’ve led multiple incident responses at a FAANG, here’s my take. The fundamental problem here is not Rust or the coding error. The problem is:
1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.
2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.
SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.
While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.
Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.
polack
2 hours ago
They failed on so many levels here.
How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?
How can the database export query not have a limit set if there is a hard limit on number of features?
Why do they do non-critical changes in production before testing in a stage environment?
Why did they think this was a cyberattack and only after two hours realize it was the config file?
Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.
I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.
michaelt
9 minutes ago
> Why did they think this was a cyberattack
Isn’t getting cyberattacked their core business?
huijzer
26 minutes ago
> They failed on so many levels here.
That's often the case with human error as especially aviation safety experts know: https://en.wikipedia.org/wiki/Swiss_cheese_model
cowsandmilk
2 hours ago
> Why do they do non-critical changes in production before testing in a stage environment?
I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.
vsl
an hour ago
> Why did they think this was a cyberattack and only after two hours realize it was the config file?
They explain that at some length in TFA.
raxxorraxor
2 hours ago
I don't think these are realistic requirements for any engineered system to be honest. Realistic is to have contingencies for such cases, which are simply errors.
But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.
jve
an hour ago
> I'm migrating my customers off Cloudflare.
Is that an overreaction?
Name me global, redundant systems that have not (yet) failed.
And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.
I mean no service have 100% uptime - just that some have more nines than others.
Carriethebest
an hour ago
There are many self-hosted alternatives to protect against botnet. We don't have to use cloudflare. Everthing is under their control!
KronisLV
an hour ago
> There are many self-hosted alternatives to protect against botnet.
What would some good examples of those be? I think something like Anubis is mostly against bot scraping, not sure how you'd mitigate a DDoS attack well with self-hosted infra if you don't have a lot of resources?
On that note, what would be a good self-hosted WAF? I recall using mod_security with Apache and the OWASP ruleset, apparently the Nginx version worked a bit slower (e.g. https://www.litespeedtech.com/benchmarks/modsecurity-apache-... ), there was also the Coraza project but I haven't heard much about it https://coraza.io/ or maybe the people who say that running a WAF isn't strictly necessary also have a point (depending on the particular attack surface).
Genuine questions.
jve
31 minutes ago
Well if you self host DDoS protection service, that would be VERY expensive. You would need rent rack space along with a very fast internet connection at multiple data centers to host this service.
sofixa
an hour ago
> There are many self-hosted alternatives to protect against botnet
Whatever you do, unless you have their bandwidth capacity, at some point those "self-hosted" will get flooded with traffic.
purple_turtle
an hour ago
Can you name three of this many alternatives?
How they magically manage DDOS larger than their bandwidth?
If the plan is to have larger bandwidth than any DDOS it is going to be expensive, quickly.
rkachowski
an hour ago
Classic @devops_borat
"To make error is human. To propagate error to all server in automatic way is #devops"
matternot
2 hours ago
I don't understand why they didn't validate and sanitize the new config file revision. If bad(whatever that reason is) throw an error and revert back to previous version. You don't need to take down the whole internet for that.
WJW
an hour ago
Same as for almost every bug I think: the dev in question hadn't considered that the input could be bad in the way that it turned out to be. Maybe they were new, or maybe they hadn't slept much because of a newborn baby, or maybe they thought it was a reasonable assumption that there would never be more than 200 ML features in the array in question. I don't think this developer will ever make the same mistake again at least.
Let those who have never written a bug before cast the first stone.
adriand
5 minutes ago
> Maybe they were new, or maybe they hadn't slept much because of a newborn baby
Reminds me of House of Dynamite, the movie about nuclear apocalypse that really revolves around these very human factors. This outage is a perfect example of why relying on anything humans have built is risky, which includes the entire nuclear apparatus. “I don’t understand why X wasn’t built in such a way that wouldn’t mean we live in an underground bunker now” is the sentence that comes to mind.
JB_Dev
5 hours ago
Does their ring based rollout really truly have to be 0->100% in a few seconds?
I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.
As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.
cowsandmilk
an hour ago
The configuration file is updated every five minutes, so clearly they have some past experience where they’ve decided an hour is too long. That said, even a roll out over five minutes can be helpful.
NicoJuicy
3 hours ago
I think defence against a DDOS against your network is the best reason for a quick rollout
matteocontrini
an hour ago
This was not about DDoS defense but the Bot Management feature, which is a paid Enterprise-only feature not enabled by default to block automated requests regardless of whether an attack is going on.
https://developers.cloudflare.com/bots/get-started/bot-manag...
inemesitaffia
an hour ago
So if you didn't enable it your stuff would work?
matteocontrini
an hour ago
It would still fail if you were unluckily on the new proxy (it's not very clear why if the feature was not enabled, indeed):
> Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.
> Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.
jabl
3 hours ago
Maybe, but in that case maybe have some special casing logic to detect that yes indeed we're under a massive DDOS at this very moment, do a rapid rollout of this thing that will mitigate said DDOS. Otherwise use the default slower one?
Of course, this is all so easy to say after the fact..
xp84
2 hours ago
Isn’t CF under a ‘massive DDOS’ 24/7 pretty much by definition? When does malicious traffic rest, and how many targets of same aren’t using CF?
NicoJuicy
2 hours ago
It's literally in the blog post as well
> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:
dev_l1x_be
4 hours ago
Exactly the right take. Even when you want to have rapid changes on your infra, do it at least by region. You can start with the region where the least amount of users are impacted and if everything is fine, there is no elevated number of crashes for example, you can move forward. It was a standard practice at $RANDOM_FAANG when we had such deployments.
abalone
2 hours ago
Thank you. I am sympathetic to CF’s need to deploy these configs globally fast and don’t think slowing down their DDoS mitigation is necessarily a good trade off. What I am saying is this presents a bigger reliability risk and needs correspondingly fine crafted observability around such config changes and a rollback runbook. Greater risk -> greater attention.
HelloNurse
2 hours ago
The "coding error" is a somewhat deliberate choice to fail eagerly that is usually safe but doesn't align with the need to do something (propagation of the configuration file) without failing.
I'm sure that there are misapplied guidelines to do that instead of being nice to incoming bot management configuration files, and someone might have been scolded (or worse) for proposing or attempting to handle them more safely.
pas
4 hours ago
Rolling out new code should be done differently than rolling out new data to fight bots.
If every time there's a new bot someone needs to write code that can blow up their whole service, maybe they need to iterate a bit on this design?
watchful_moose
2 hours ago
This isn't what they do, though. This is a data/config push - original article says _a “feature file” used by our Bot Management system_
jbs789
5 hours ago
Thanks for this assessment.
In a productive way, this view also shifts the focus to improving the system (visibility etc), empowering the team, rather than focusing on the code which broke (probably strikes fear in the individuals, to do anything!)
zelphirkalt
2 hours ago
It is just 2 different layers. Of course the code is also a problem, if it is in fact as the GP describes it. You are taking the higher level view, which is the second layer of dealing with not only this specific mistake, but also other mistakes, that can be related to arbitrary code paths.
Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.
tormeh
3 hours ago
Partial disagree. There should be lints against 'unwrap's. An 'expect' at least forces you to write down why you are so certain it can't fail. An unwrap is not just hubris, it's also laziness, and has no place in sensitive code.
And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.
ignoramous
5 hours ago
> Their bot management system is designed to push a configuration out to their entire network rapidly.
Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].
> While it’s certainly useful to examine the root cause in the code.
Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).
Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].
Of course, as with all distributed system failures, this is all easier said and done in hindsight.
[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...
[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050
abalone
5 hours ago
By rapid I mean a rapid rollout of changes to 100% of the fleet, not how often changes are made.
Hamuko
5 hours ago
>Once every 5m is not "rapidly".
Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.
nrhrjrjrjtntbt
4 hours ago
This guy SREs
ithkuil
2 hours ago
Back when it meant Site Reliability Engineer and not Sysadmin Really Expensive
wrs
11 hours ago
It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.
It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
brabel
4 hours ago
Yes, I always thought it was wrong to use unwrap in examples. I know, people want to keep examples simple, but it trains developers to use unwrap() as they see that everywhere. Yes, there are places where it's ok as that blog post explains so well: https://burntsushi.net/unwrap/ But most devs IMHO don't have the time to make the call correctly most of the time... so it's just better to do something better, like handle the error and try to recover, or if impossible, at least do `expect("damn it, how did this happen")`.
gwd
2 hours ago
> Yes, I always thought it was wrong to use unwrap in examples.
And because it gets picked up by LLMs. It would be interesting to know if this particular .unwrap() was written by a human.
wongarsu
2 hours ago
> at least do `expect("damn it, how did this happen")`
That gives you the same behavior as unwrap with a less useful error message though. In theory you can write useful messages, but in practice (and your example) expect is rarely better than unwrap in modern rust
frumplestlatz
2 hours ago
I have to disagree that unwrap is ever OK. If you have to use unwrap, your types do not match your problem. Fix them. You have encoded invariants in your types that do not match reality.
Change your API boundary, surface the discrepancy between your requirements and the potential failing case at the edges where it can be handled.
If you need the value, you need to handle the case that it’s not available explicitly. You need to define your error path(s)
Anything else leads to, well, this.
inferiorhuman
3 hours ago
Dunno, I think the alternatives have their own pretty significant downsides. All would require front loading more in-depth understanding of error handling and some would just be quite a bit more verbose.
IMO making unwrap a clippy lint (or perhaps a warning) would be a decent start. Or maybe renaming unwrap.
dist1ll
10 hours ago
> every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
How about indexing into a slice/map/vec? Should every `foo[i]` have an infallibility comment? Because they're essentially `get(i).unwrap()`.
10000truths
10 hours ago
Yes? Funnily enough, I don't often use indexed access in Rust. Either I'm looping over elements of a data structure (in which case I use iterators), or I'm using an untrusted index value (in which case I explicitly handle the error case). In the rare case where I'm using an index value that I can guarantee is never invalid (e.g. graph traversal where the indices are never exposed outside the scope of the traversal), then I create a safe wrapper around the unsafe access and document the invariant.
dist1ll
9 hours ago
If that's the case then hats off. What you're describing is definitely not what I've seen in practice. In fact, I don't think I've ever seen a crate or production codebase that documents infallibility of every single slice access. Even security-critical cryptography crates that passed audits don't do that. Personally, I found it quite hard to avoid indexing for graph-heavy code, so I'm always on the lookout for interesting ways to enforce access safety. If you have some code to share that would be very interesting.
koito17
4 hours ago
> I don't think I've ever seen a crate or production codebase that documents infallibility of every single slice access.
The smoltcp crate typically uses runtime checks to ensure slice accesses made by the library do not cause a panic. It's not exactly equivalent to GP's assertion, since it doesn't cover "every single slice access", but it at least covers slice accesses triggered by the library's public API. (i.e. none of the public API functions should cause a panic, assuming that the runtime validation after the most recent mutation succeeds).
Example: https://docs.rs/smoltcp/latest/src/smoltcp/wire/ipv4.rs.html...
zelphirkalt
2 hours ago
I think this goes against the Rust goals in terms of performance. Good for safe code, of course, but usually Rust users like to have compile time safety to making runtime safety checks unnecessary.
10000truths
7 hours ago
My rule of thumb is that unchecked access is okay in scenarios where both the array/map and the indices/keys are private implementation details of a function or struct, since an invariant is easy to manually verify when it is tightly scoped as such. I've seen it used it in:
* Graph/tree traversal functions that take a visitor function as a parameter
* Binary search on sorted arrays
* Binary heap operations
* Probing buckets in open-addressed hash tables
hansvm
9 hours ago
> graph-heavy code
Could you share some more details, maybe one fully concrete scenario? There are lots of techniques, but there's no one-size-fits-all solution.
dist1ll
8 hours ago
Sure, these days I'm mostly working on a few compilers. Let's say I want to make a fixed-size SSA IR. Each instruction has an opcode and two operands (which are essentially pointers to other instructions). The IR is populated in one phase, and then lowered in the next. During lowering I run a few peephole and code motion optimizations on the IR, and then do regalloc + asm codegen. During that pass the IR is mutated and indices are invalidated/updated. The important thing is that this phase is extremely performance-critical.
wrs
7 hours ago
And it's fine for a compiler to panic when it violates an assumption. Not so with the Cloudflare code under discussion.
tux3
10 hours ago
Usually you'd want to write almost all your slice or other container iterations with iterators, in a functional style.
For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.
You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.
wrs
9 hours ago
I didn't restate the context from the code we're discussing: it must not panic. If you don't care if the code panics, then go ahead and unwrap/expect/index, because that conforms to your chosen error handling scheme. This is fine for lots of things like CLI tools or isolated subprocesses, and makes review a lot easier.
So: first, identify code that cannot be allowed to panic. Within that code, yes, in the rare case that you use [i], you need to at least try to justify why you think it'll be in bounds. But it would be better not to.
There are a couple of attempts at getting the compiler to prove that code can't panic (e.g., the no-panic crate).
lenkite
6 hours ago
What about memory allocation - how will you stop that from panicking ? `Vec::resize` will always panic in Rust. And this is just one example out of thousands in the Rust stdlib.
Unless the language addresses no-panic in its governing design or allows try-catch, not sure how you go about this.
wrs
6 hours ago
That is slowly being addressed, but meanwhile it’s likely you have a reliable upper bound on how much heap your service needs, so it’s a much smaller worry. There are also techniques like up-front or static allocation if you want to make more certain.
NewJazz
6 hours ago
Yep and this postmortem details how their proxy modules use static allocation.
assbuttbuttass
6 hours ago
In TFA they mentioned they preallocate all the memory up front
kibwen
9 hours ago
Indexing is comparatively rare given the existence of iterators, IMO. If your goal is to avoid any potential for panicking, I think you'd have a harder time with arithmetic overflow.
echelon
8 hours ago
Cargo needs to grow a label for crates that provably do not panic. (Neverminding allocations and things outside our control flow.)
I want to ban crates that panic from my dependency chain.
The language could really use an extra set of static guarantees around this. I would opt in.
lenkite
6 hours ago
> I want to ban crates that panic from my dependency chain.
Which means banning anything that allocates memory and thousands of stdlib functions/methods.
echelon
6 hours ago
See the immediately preceding sentence.
I'm fine with allocation failures. I don't want stupid unwrap()s, improper slice access, or other stupid and totally preventable behavior.
There are things inside the engineer's control. I want that to not panic.
throwaway2037
4 hours ago
Your pair of posts is very interesting to me. Can you share with me: What is your programming environment such that you are "fine with allocation failures"? I'm not doubting you, but for me, if I am doing systems programming with C or C++, my program is doomed if a malloc fails! When I saw your post, I immediately thought: Am I doing it wrong? If I get a NULL back from malloc(), I just terminate with an error message.
PhilipRoman
15 minutes ago
Not to mention overcommit has become standard behavior on many systems, so you wouldn't even get a NULL unless you really tried.
deredede
4 hours ago
Not GP but I read "I'm fine with allocation failures" as "I'm OK with my program terminating if it can't allocate (but not for other errors)".
johnisgood
2 hours ago
I mean, yeah, if I am using a library, as an user of this library, I would like to be able to handle the error myself. Having the library decide to panic, for example, is the opposite of it.
phire
7 hours ago
I think I'd prefer a compile-time guarantee.
Something that allows me to tag annotate a function (or my whole crate) as "no panic", and get a compile error if the function or anything it calls has a reachable panic.
This will allow it to work with many unmodified crates, as long as constant propagation can prove that any panics are unreachable. This approach will also allow crates to provide panicking and non panicking versions of their API (which many already do).
aw1621107
6 hours ago
I think the most common solution at the moment is dtolnay's no_panic [0]. That has a bunch of caveats, though, and the ergonomics leave something to be desired, so a first-party solution would probably be preferable.
echelon
6 hours ago
Yes, I want that. I also want to be able to (1) statically apply a badge on every crate that makes and meets these guarantees (including transitively with that crate's own dependencies) so I can search crates.io for stronger guarantees and (2) annotate my Cargo.toml to not import crates that violate this, so time isn't wasted compiling - we know it'll fail in advance.
On the subject of this, I want more ability to filter out crates in our Cargo.toml. Such as a max dependency depth. Or a frozen set of dependencies that is guaranteed not to change so audits are easier. (Obviously we could vendor the code in and be in charge of our own destiny, but this feels like something we can let crate authors police.)
yakshaving_jgt
2 hours ago
This sounds a little bit like Safe Haskell, which never really took off.
dist1ll
10 hours ago
For iteration, yes. But there's other cases, like any time you have to deal with lots of linked data structures. If you need high performance, chances are that you'll have to use an index+arena strategy. They're also common in mathematical codebases.
danielheath
10 hours ago
I mean... yeah, in general. That's what iterators are for.
dehrmann
9 hours ago
It's the same blind spot people have to Java's checked exceptions. People commonly resort to Pokemon exception handling and either blindly ignoring or rethrowing as a runtime exception. When Rust got popular, I was a bit confused by people talking about how great Result it's essentially a checked exception without a stack trace.
Terr_
9 hours ago
"Checked Exceptions Are Actually Good" gang, rise up! :p
I think adoption would have played out very different if there had only been some more syntactic-sugar. For example, an easy syntax for saying: "In this method, any (checked) DeepException e that bubbles up should immediately be replaced by a new (checked) MylayerException(e) that contains the original one as a cause.
We might still get lazy programmers making systems where every damn thing goes into a generic MylayerException, but that mess would still be way easier to fix later than a hundred scattered RuntimeExceptions.
twhitmore
an hour ago
Exception handling would be better than what we're seeing here.
The problem is that any non-trivial software is composition, and encapsulation means most errors aren't recoverable.
We just need easy ways to propagate exceptions out to the appropriate reliability boundary, ie. the transaction/ request/ config loading, and fail it sensibly, with an easily diagnosable message and without crashing the whole process.
C# or unchecked Java exceptions are actually fairly close to ideal for this.
The correct paradigm is "prefer throw to catch" -- requiring devs to check every ret-val just created thousands of opportunities for mistakes to be made.
By contrast, a reliable C# or Java version might have just 3 catch clauses and handle errors arising below sensibly without any developer effort.
https://literatejava.com/exceptions/ten-practices-for-perfec...
Skeime
2 hours ago
There is also the problem that they decided to make all references nullable, so `NullPointerException`s could appear everywhere. This "forced" them to introduce the escape hatch of `RuntimeException`, which of course was way overused immediately, normalizing it.
bigstrat2003
8 hours ago
I'm with you! Checked exceptions are actually good and the hate for them is super short sighted. The exact same criticisms levied at checked exceptions apply to static typing in general, but people acknowledge the great value static types have for preventing errors at compile time. Checked exceptions have that same value, but are dunked on for some reason.
never_inline
7 hours ago
The dislike is probably because of 2 reasons.
1. in most cases they don't want to handle `InterruptedException` or `IOException` and yet need to bubble them up. In that case the code is very verbose.
2. it makes lambdas and functions incompatible. So eg: if you're passing a function to forEach, you're forced to wrap it in runtime exception.
3. Due to (1) and (2), most people become lazy and do `throws Exception` which negates most advantages of having exceptions in the first place.
In line-of-business apps (where Java is used the most), an uncaught exception is not a big deal. It will bubble up and gets handled somewhere far up the stack (eg: the server logger) without disrupting other parts of the application. This reduces the utility of having every function throw InterruptedException / IOException when those hardly ever happen.
dehrmann
5 hours ago
> 2. it makes lambdas and functions incompatible.
This is true, but the hate predated lambdas in Java.
ErikCorry
2 hours ago
You could always manually build the same thing as lambda with a class and you had the same problem.
frumplestlatz
2 hours ago
> an uncaught exception is not a big deal
In my experience, it actually is a big deal, leaving a wake of indeterminant state behind after stack unrolling. The app then fails with heisenbugs later, raising more exceptions that get ignored, compounding the problem.
People just shrug off that unreliability as an unavoidable cost of doing business.
Terr_
7 hours ago
Yeah, in both cases it's a layering situation, where it's the duty of your code to decide what layers of abstraction need to be be bridged, and to execute on that decision. Translating/wrapping exception-types from deeper functions is the same as translating/wrapping return-types the same places.
I think it comes down to a psychological or use-case issue: People hate thinking about errors and handling them, because it's that hard stuff that always consumes more time than we'd like to think. Not just digitally, but in physical machines too. It's also easier to put off "for later."
lenkite
6 hours ago
Checked exceptions in theory were good, but Java simply did not add facilities to handle or support them well in many APIs. Even the new API's in Java - Streams, etc do not support checked exceptions.
gwbas1c
9 hours ago
It's a lot lighter: a stack trace takes a lot of overhead to generate; a result has no overhead for a failure. The overhead (panic) only comes once the failure can't be handled. (Most books on Java/C# don't explain that throwing exceptions has high performance overhead.)
Exceptions force a panic on all errors, which is why they're supposed to be used in "exceptional" situations. To avoid exceptions when an error is expected, (eof, broken socket, file not found,) you either have to use an unnatural return type or accept the performance penalty of the panic that happens when you "throw."
In Rust, the stack trace happens at panic (unwrap), which is when the error isn't handled. IE, it's not when the file isn't found, it's when the error isn't handled.
branko_d
4 hours ago
> Exceptions force a panic on all errors
What do you mean?
Exceptions do not force panic at all. In most practical situations, an exception unhandled close to where it was thrown will eventually get logged. It's kind of a "local" panic, if you will, that will terminate the specific function, but the rest of the program will remain unaffected. For example, a web server might throw an exception while processing a specific HTTP request, but other HTTP requests are unaffected.
Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state, and therefore does not require terminating the entire program.
frumplestlatz
2 hours ago
> Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state
When everyone uses runtime exceptions and doesn’t count for exception handling in every possible code path, that’s exactly what it means.
dehrmann
5 hours ago
> a stack trace takes a lot of overhead to generate
Can't Hotspot not generate the stack trace when it knows the exception will be caught and the stack trace ignored?
BlackFly
3 hours ago
> it's essentially a checked exception without a stack trace
In theory, theory and practice are the same. In practice...
You can't throw a checked exception in a stream, this fact actually underlines the key difference between an exception and a Result: Result is in return position and exceptions are a sort of side effect that has its own control flow. Because of that, once your method throws an Exception or you are writing code in a try block that catches an exception, you become blind to further exceptions of that type, even if you might be able to or required to fix those errors. Results are required to be handled individually and you get syntactic sugar to easily back propagate.
It is trivial to include a stack trace, but stack traces are really only useful for identifying where something occurred, and generally what is superior is attaching context as you back propagate which trivially occurs with judicious use of custom error types with From impls. Doing this means that the error message uniquely defines the origin and paths it passed through without intermediate unimportant stack noise. With exceptions you would always need to catch each exception and rethrow a new exception containing the old to add contextual information, then to avoid catching to much you need variables that will be initialized inside the try block defined outside of the try block. So stack traces are basically only useful when you are doing Pokemon exception handling.
baq
3 hours ago
checked exceptions failed because when used properly they fossilize method signatures. they're fine if your code will never be changed and they're fine when you control 100% of users of the throwing code. if you're distributing a library... no bueno.
frumplestlatz
2 hours ago
That’s just not true. They required that you use hierarchical exception types and define your own library exception type that you declare at the boundary.
The same is required for any principled error handling.
Ygg2
2 hours ago
> When Rust got popular, I was a bit confused by people talking about how great Result it's essentially a checked exception without a stack trace.
It's not a checked exception without a stack trace.
Rust doesn't have Java's checked or unchecked exception semantics at the moment. Panics are more like Java's Errors (e.g. OOM error). Results are just error codes on steroids.
speed_spread
9 hours ago
Pet peeve: unwrap() should be deprecated and renamed or_panic(). More consistent with the rest of stdlib methods and appropriately scarier.
wrs
7 hours ago
That's kind of what I'm saying with the blind spot comment. The words "unwrap" and "expect" should be just as much a scary red flag as the word "panic", but for some reason it seems a lot of people don't see them that way.
shadowmatter
6 hours ago
Even in lowly Java, they later added to Optional the orElseThrow() method since the name of the get() method did not connote the impact of unwrapping an empty Optional.
vbezhenar
an hour ago
I've found both methods very useful. I'm using `get()` when I've checked that the value is present and I don't expect any exceptions. I'm using `orElseThrow()` when I actually expect that value can be absent and throwing is fine. Something like
if (userOpt.isPresent()) {
var user = userOpt.get();
var accountOpt = accountRepository.selectAccountOpt(user.getId());
var account = accountOpt.orElseThrow();
}
Idea checks it by default and highlights if I've used `get()` without previous check. It's not forced at compiler level, but it's good enough for me.echelon
8 hours ago
A lot of stuff should be done about the awful unwrap family of methods.
A few ideas:
- It should not compile in production Rust code
- It should only be usable within unsafe blocks
- It should require explicit "safe" annotation from the engineer. Though this is subject to drift and become erroneous.
- It should be possible to ban the use of unsafe in dependencies and transitive dependencies within Cargo.
kibwen
8 hours ago
The `unsafe` keyword means something specific in Rust, and panicking isn't unsafe by Rust's definition. Sometimes avoiding partial functions just isn't feasible, and an unwrap (or whatever you want to call the method) is a way of providing a (runtime-checked) proof to the compiler that the function is actually total.
echelon
8 hours ago
Panics should be explicit, not implicit.
unwrap() should effectively work as a Result<> where the user must manually invoke a panic in the failure branch. Make special syntax if a match and panic is too much boilerplate.
This is like an implicit null pointer exception that cannot be statically guarded against.
I want a way to statically block any crates doing this from my dependency chain.
bigstrat2003
7 hours ago
unwrap is explicit.
speedgoose
4 hours ago
Not explicit enough, apparently.
wrs
7 hours ago
Not sure what you're saying with the "work as a Result<>" part...unwrap is a method on Result. I think you're just saying the unwrap/expect methods should be eliminated?
dev_l1x_be
4 hours ago
Than they are going to write None | Err => yolo() that has the same impact. It is not the syntax or the semantic meaning is the problem here but the fact that there is no monitoring around the elevated error counts after a deployment.
Software engineers tend to get stuck in software problems and thinking that everything should be fixed in code. In reality there are many things outside of the code that you can do to operate unreliable components safely.
bombela
4 hours ago
This thread warms my heart. Rust has set a new baseline that many and myself now take for granted.
We are now discussing what can be done to improve code correctness beyond memory and thread safety. I am excited for what is to come.
StopDisinfo910
3 hours ago
Alternatively you can look at actually innovative programming languages to peek at the next 20 years of innovation.
I am not sure that watching the trendy forefront successfully reach the 1990s and discuss how unwrapping Option is potentially dangerous really warm my heart. I can’t wait for the complete meltdown when they discover effect systems in 2040.
To be more serious, this kind of incident is yet another reminder that software development remains miles away from proper engineering and even key providers like Cloudfare utterly fail at proper risk management.
Celebrating because there is now one popular language using static analysis for memory safety feels to me like being happy we now teach people to swim before a transatlantic boat crossing while we refuse to actually install life boats.
To me the situation has barely changed. The industry has been refusing to put in place strong reliability practices for decades, keeps significantly under investing in tools mitigating errors outside of a few fields where safety was already taken seriously before software was a thing and keeps hiding behind the excuse that we need to move fast and safety is too complex and costly while regulation remains extremely lenient.
I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
JuniperMesos
2 hours ago
Let's try to make effect systems happen quicker than that.
> I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
Probably nothing, because most paying customers of cloudflare are probably signing away their rights to sue Cloudflare for damages by being down for a while when they purchase Cloudflare's services (maybe some customers have SLAs with monetary values attached, I dunno). I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.
Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see. Should it be illegal under US law to use `unwrap` in Rust code? Should it be illegal for any single internet services company to have more than X number of customers? A lot of the internet also breaks when AWS goes down because many people like to use AWS, so maybe they should be included in this regulatory framework too.
StopDisinfo910
an hour ago
> I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.
We have collectively agreed to a world where software service providers have no incentive to be reliable as they are shielded from the consequences of their mistakes and somehow we see it as acceptable that software have a ton of issues and defects. The side effect is that research on actually lowering the cost of safety has little return on investment. It doesn't have be so.
> Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see.
I want software provider to be liable for the damage they cause and minimum quality regulation on par with an actual engineering discipline. I have always been astounded that nearly all software licences start with extremely broad limitation of liability provisions and people somehow feel fine with it. Try to extend that to any other product you regularly use in your life and see how that makes you fell.
How to do proper testing, formal methods and resilient design have been known for decades. I would personnaly be more than okay with let's move less fast and stop breaking things.
frumplestlatz
2 hours ago
I find myself in the odd position of agreeing with you both.
That we’re even having this discussion is a major step forward. That we’re still having this discussion is a depressing testament to how slow slowly the mainstream has adopted better ideas.
littlestymaar
6 hours ago
> In production code an unwrap or expect should be reviewed exactly like a panic.
An unwrap should never make it to production IMHO. It's fine while prototyping, but once the project gets closer to production it's necessary to just grep `uncheck` in your code and replace those that can happen with a proper error management and replace those that cannot happen with `expect`, with a clear justification of why they cannot happen unless there's a bug somewhere else.
wrs
6 hours ago
I would say, sure, if you feel the same way about panic calls making to production. In other words, review all of them the same way. Because writing unwrap/expect is exactly the same as writing “if error, panic”.
anonnon
8 hours ago
> It seems people have a blind spot for unwrap
Not unlike people having a blind spot for Rust in general, no?
smj-edison
10 hours ago
Isn't the point of this article that pieces of infrastructure don't go down to root causes, but due to bad combinations of components that are correct individually? After reading "engineering a safer world", I find root cause analysis rather reductionistic, because it wasn't just an unwrap, it was that the payload was larger than normal, because of a query that didn't select by database, because a clickhouse made more databases visible. Hard to say "it was just due to an unwrap" imo. Especially in terms of how to fix an issue going forwards. I think the article lists a lot of good ideas, that aren't just "don't unwrap", like enabling more global kill switches for features, or eliminating the ability for core dumps or other error reports to overwhelm system resources.
brianpan
4 hours ago
You're right. A good postmortem/root cause analysis would START from "unwrap" and continue from there.
You might start with a basic timeline of what happened, then you'd start exploring: why did this change affect so many customers (this would be a line of questioning to find a potential root cause), why did it take so long to discover or recover (this might be multiple lines of questioning), etc.
sphericalkat
6 minutes ago
Handling the error still would've returned a 5xx in this case, since the config file was still over the limit of features the service could handle.
jcalvinowens
9 hours ago
> This is the multi-million dollar .unwrap() story.
That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
sznio
4 hours ago
>If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
If the `.unwrap()` was replaced with `.expect("Feature config is too large!")` it would certainly make the outage shorter.
YetAnotherNick
3 hours ago
In general for unexpected errors like these the internal function should log the error, and I assume it was either logged or they can quickly deduce reason based on the line number.
Ygg2
2 hours ago
> If the `.unwrap()` was replaced with `.expect("Feature config is too large!")` it would certainly make the outage shorter.
It wouldn't, not meaningfully. The outage was caused by change in how they processed the queries. They had no way to observe the changes, nor canaries to see that change is killing them. Plus, they would still need to manually feed and restart services that ingested bad configs.
`expect` would shave a few minutes; you would still spend hours figuring out and fixing it.
Granted, using expect is better, but it's not a silver bullet.
AdieuToLogic
8 hours ago
>> This is the multi-million dollar .unwrap() story.
> That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.
jcalvinowens
8 hours ago
> Instead, the routine likely should have worked within `Result`.
But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.
Maybe you're saying "it's better to be explicit", as a broad generalization I don't disagree with that.
But that has nothing to do with the actual bug here, which was that the invariant failed. How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.
fredoralive
4 hours ago
A failed config load probably shouldn't be a fatal error if a valid config is already loaded?
ergocoder
4 hours ago
Hard to say. Why would you load a new config if a valid config is already loaded?
Maybe the new config has a new update. Who knows? Do we want to keep operating on the old config? Maybe maybe not.
But operating on old config when you don't want to is definitely worse.
navidmafi
8 hours ago
And dare I say, an exhibition of hindsight bias.
chrisweekly
8 hours ago
semantic? or pedantic?
twhitmore
an hour ago
Interesting to see Rust error handling flunk out in practice.
It may be that forcing handling at every call tends to makes code verbose, and devs insensitized to bad practice. And the diagnostic Rust provided seems pretty garbage.
There is bad practice here too -- config failure manifesting as request failure, lack of failing to safe, unsafe rollout, lack of observability.
Back to language design & error handling. My informed view is that robustness is best when only major reliability boundaries need to be coded.
This the "throw, don't catch" principle with the addition of catches on key reliability boundaries -- typically high-level interactions where you can meaningfully answer a failure.
For example, this system could have a total of three catch clauses "Error Loading Config" which fails to safe, "Error Handling Request" which answers 5xx, and "Socket Error" which closes the HTTP connection.
Ciantic
22 minutes ago
> It may be that forcing handling at every call tends to makes code verbose
Rust has a lot of helpers to make it less verbose, even that error they demonstrate could've been written in some form `...code()?` with `?` helper that would have propagated the error forwards.
However I do acknowledge that writing Error types is boring sometimes so people don't bother to change their error types and just unwrap. But even my dinghy little apps for my personal use I do simple serach `unwrap` and make sure I have as few as possible.
slanterns
7 hours ago
> Today, many friends pinged me saying Cloudflare was down. As a core developer of the first generation of Cloudflare FL, I'd like to share some thoughts.
> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.
> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.
> Why not roll back software versions or restart?
> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.
> Why not roll back the configuration?
> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.
anonymous908213
6 hours ago
This tweet thread invokes genuine despair in me. Do we really have to outsource even our tweets to LLMs? Really? I mean, I get spambots and the like tweeting mass-produced slop. But what compels a former engineer of the company in question to offer LLM-generated "insight" to the outage? Why? For what purpose?
* For clarity, I am aware that the original tweets are written in Chinese, and they still have the stench of LLM writing all over them; it's not just the translation provided in the above comment.
deadcore
3 hours ago
Out of interest... apart from the em dash, how else can you tell it's an LLM response? What are the telltales signs?
anonymous908213
an hour ago
This particular excerpt is reeking of it with pretty much every line. I'll point out the patterns in the English translation, but all of these patterns apply cross-language.
> classic chain reaction triggered by “hidden assumptions + configuration chains”
"Classic/typical "x + y"", particularly when diagnosing an issue. This one is a really easy tell because humans, on aggregate, do not use quotation marks like this. There is absolutely no reason to quote these words here, and yet LLMs will do a combined quoted "x + y" where a human would simply write something natural like "hidden assumptions and configuration chains" without extraneous quotes.
> The configuration system requires “bad → reject, keep last-known-good” logic.
Another pattern with overeager usage of quotes is this ""x → y, z"" construct with very terse wording.
> This wasn't an attack, but a classic chain reaction
LLMs aggressively use "Not X, but Y". This is also a construct commonly used by humans, of course, but aside from often being paired with an em-dash, another tell is whether it actually contributes anything to the sentence. "Not X, but Y" is strongly contrasting and can add a dramatic flair to the thing being constrasted, but LLMs overuse it on things that really really don't need to be dramatised or contrasted.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
Two lists of three concepts back-to-back. LLMs enjoy, love, and adore this construct.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
This kind of completely vapid, feel-good word soup utilising a heroic analogy for something relatively mundane is another tell.
And more broadly speaking, there's a sort of verbosity and emptiness of actual meaning that permeates through most LLM writing. This reads absolutely nothing like what an engineer breaking down an outage looks like. Like, the aforementioned line of... "Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.". What is that actually communicating to you? It piles on technical lingo and high-level concepts in a way that is grammatically correct but contains no useful information for the reader.
Bad writing exists, of course. There's plenty of bad writing out there on the internet, and some of it will suffer from flaws like these even when written by a human, and some humans do like their em-dashes. But it's generally pretty obvious when the writing is taken on aggregate and you see recognisable pattern after pattern combined with em-dashes combined with shallowness of meaning combined with unnecessary overdramatisations.
wonnage
6 hours ago
It’s kinda funny that the “not …, but…” + em dash slop signifiers translate directly to Chinese “不是。。。而是。。。” and double-width em dash
vlovich123
10 hours ago
To be fair, this failed in the non-rust path too because the bot management returned that all traffic was a bot. But yes, FL2 needs to catch panics from individual components but I’m not sure if failing open is necessarily that much better (it was in this case but the next incident could easily be the result of failing open).
But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.
hedora
10 hours ago
Catching panic probably isn’t a great idea if there’s any unsafe code in the system. (Do the unsafe blocks really maintain heap invariants if across panics?)
vlovich123
9 hours ago
Unsafe blocks have nothing to do with it. Yes - they maintain all the same invariants as safe blocks or those unsafe blocks are unsound regardless of panics. But there’s millions of way to architect this (eg a supervisor process that notices which layer in FL2 is crashing and just completely disables that layer when it starts up the proxy again. There’s challenges here because then you have to figure out what constitutes a perma crashing (eg what if it’s just 20% of all sites? Do you disable?). And in the general case you have the fail open/fail close decision anyway which you should just annotate individual layers with.
But the bigger change is to make sure that config changes roll out gradually instead of all at once. That’s the source of 99% of all widespread outages
Feathercrown
8 hours ago
Incremental config changes sounds like it could lead to a LOT of bugs
vlovich123
6 hours ago
Incremental in terms of 1% of the fleet using it, then 5% etc. this is standard course.
Another option is to make sure that config changes that fail to parse continue using the old config instead of resulting in an unusable service.
kibwen
9 hours ago
I think the parent is implying that the panic should be "caught" via a supervisor process, Erlang-style, rather than implying the literal use of `catch_unwind` to resume within the same process.
vlovich123
6 hours ago
Supervisor is the brutalist way. But catch_unwind may be needed for perf and other reasons.
But ultimately it’s not the panic that’s the problem but a failure to specify how panics within FL2 layers should be handled; each layer is at least one team and FL2’s job is providing a safe playground for everyone to safely coexist regardless of the misbehavior of any single component
But as always such failures are emblematic of multiple things going wrong at once. You probably want to end up using both catch_unwind for the typical case and the supervisor for the case where there’s a segfault in some unsafe code you call or native library you invoke.
I also mention the fundamental tension of do you want to fail open or closed. Most layers should probably fail open. Some layers (eg auth) it’s safer to fail closed.
AgentME
10 hours ago
This is assuming that the process could have done anything sensible while it had the malformed feature file. It might be in this case that this was one configuration file of several and maybe the program could have been built to run with some defaults when it finds this specific configuration invalid, but in the general case, if a program expects a configuration file and can't do anything without it, panicking is a normal thing to do. There's no graceful handling (beyond a nice error message) a program like Nginx could do on a syntax error in its config.
The real issue is further up the chain where the malformed feature file got created and deployed without better checks.
aloha2436
9 hours ago
> panicking is a normal thing to do
I do not think that if the bot detection model inside your big web proxy has a configuration error it should panic and kill the entire proxy and take 20% of the internet with it. This is a system that should fail gracefully and it didn't.
> The real issue
Are there single "real issues" with systems this large? There are issues being created constantly (say, unwraps where there shouldn't be, assumptions about the consumers of the database schema) that only become apparent when they line up.
bobbylarrybobby
8 hours ago
I don't know too much about how the feature file distribution works but in the event of failure to read a new file, wouldn't logging the failure and sticking with the previous version of the file be preferable?
guiriduro
4 hours ago
That's exactly the point (ie just prior to distribution) where a simple sanity check should have been run and the config replacement/update pipeline stopped on failure. When they introduced the 200 entry limit memory optimised feature loader it should have been a no-brainer to insert that sanity check in the config production pipeline.
WD-42
9 hours ago
Yea, Rust is safe but it’s not magic. However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message. The question is then could the cloudflare code have exited cleanly in a way that made recovery easier instead of just straight panicking.
KronisLV
42 minutes ago
> However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message.
The thing I dislike most about Nginx is that if you are using it as a reverse proxy for like 20 containers and one of them is up, the whole web server will refuse to start up:
nginx: [emerg] host not found in upstream "my-app"
Obviously making 19 sites also unavailable just because one of them is caught in a crash loop isn't ideal. There is a workaround involving specifying variables, like so (non-Kubernetes example, regular Nginx web server running in a container, talking to other containers over an internal network, like Docker Compose or Docker Swarm): location / {
resolver 127.0.0.11 valid=30s; # Docker DNS
set $proxy_server my-app;
proxy_pass http://$proxy_server:8080/;
proxy_redirect default;
}
Sadly, if you try to use that approach, then you just get: nginx: [emerg] "proxy_redirect default" cannot be used with "proxy_pass" directive with variables
Sadly, switching the redirect configuration away from the default makes some apps go into a redirect loop and fail to load: mostly legacy ones, where Firefox shows something along the lines of "The page isn't redirecting properly". It sucks especially badly if you can't change the software that you just need to run and suddenly your whole Nginx setup is brittle. Apache2 and Caddy don't have such an issue.That's to say that all software out there has some really annoying failure modes, even is Nginx is pretty cool otherwise.
tempay
7 hours ago
Would expect with a message meet that criteria of exiting with a more helpful error message? From the postmortem it seems to me like they just didn’t know it even was panicing
kondro
9 hours ago
One feature failing like this should probably log the error and fail closed. It shouldn't take down everything else in your big proxy that sits in front of your entire business.
JeremyNT
9 hours ago
Exactly! Sometimes exploding is simply the least bad option, and is an entirely sensible approach.
diath
5 hours ago
Falling back to a generic base configuration in the presence of an incoming invalid config file would probably be a sensible thing to do.
NoboruWataya
9 minutes ago
They should link this article in the docs for `unwrap()`.
andy_ppp
an hour ago
This is why the Erlang/Elixir methodology of having supervision and letting things crash gracefully is so useful. You can either handle every single error gracefully or handle crashing gracefully - it's much easier and more realistic in large codebases to do the later.
tuetuopay
16 minutes ago
This would not have helped: the code would crash before doing anything useful at all.
If anything, the "crash early" mentality may even be nefarious: instead of handling the error and keeping the old config, you would spin on trying to load a broken config on startup.
ironman1478
9 hours ago
I'm not a fan of rust, but I don't think that is the only takeaway. All systems have assumptions about their input and if the assumption is violated, it has to be caught somewhere. It seems like it was caught too deep in the system.
Maybe the validation code should've handled the larger size, but also the db query produced something invalid. That shouldn't have ever happened in the first place.
asa400
6 hours ago
> It seems like it was caught too deep in the system.
Agreed, that's also my takeaway.
I don't see the problem being "lazy programmers shouldn't have called .unwrap()". That's reductive. This is a complex system and complex system failures aren't monocausal.
The function in question could have returned a smarter error rather than panicking, but what then? An invariant was violated, and maybe this system, at this layer, isn't equipped to take any reasonable action in response to that invariant violation and dying _is_ the correct thing to do.
But maybe it could take smarter action. Maybe it could be restarted into a known good state. Maybe this service could be supervised by another system that would have propagated its failure back to the source of the problem, alerting operators that a file was being generated in such a way that violated consumer invariants. Basically, I'm describing a more Erlang model of failure.
Regardless, a system like this should be able to tolerate (or at least correctly propagate) a panic in response to an invariant violation.
9rx
2 hours ago
The takeaway here isn’t about Rust itself, but that the Rust marketing crew’s claims that we constantly read on HN and elsewhere about the Result type magically saving you from making mistakes is not a good message to send.
tuetuopay
4 minutes ago
They would also tell you that .unwrap() has no place in production code, and should receive as much scrutiny as an `unsafe` block in code review :)
The point of option is the crash path is more verbose and explicit than the crash-free path. It takes more code to check for NULL in C or nil in Go; it takes more code in Rust to not check for Err.
j-krieger
2 hours ago
While this is true, I wish that Rust had more of a first-class support for `no_panic`. Every solution we do have is hacky. I wish that I could guarantee that there were no panic calls anywhere in a code path.
gwd
2 hours ago
> This is the multi-million dollar .unwrap() story.
While there are certainly many things to admire about Rust, this is why I prefer Golang's "noisy" error handling. In golang that would be either:
feature_values, err := features.append_with_names(...)
And the compiler would have complained that this value of `err` was unused; or you'd write: feature_values, _ := features.append_with_names(...)
And it would be far more obvious that an error message is being ignored.(Renaming `unwrap` to `unwrapOrPanic` would probably help too.)
jbreckmckye
42 minutes ago
But I could screw it up in Go, if I made the same assumptions
fvs, err := features.AppendWithNames(..)
if err != nil {
// this will NEVER break
panic(err)
}
Ultimately I don't think language design can be the sole line of defence against system failures; it can only guide developers to think about error casesmamp
2 hours ago
I haven't been writing Rust for that long (about 2 years) but every time I see .unwrap() I read it as 'panic in production'. Clippy needs to have harder checks on unwrap.
butvacuum
9 hours ago
It rang more as "A/B deployments are pointless if you can't tell if a downstream failure is related." To me.
ChrisMarshallNY
10 hours ago
Swift has implicit unwrap (!), and explicit unwrap (?).
I don't like to use implicit unwrap. Even things that are guaranteed to be there, I treat as explicit (For example, (self.view?.isEnabled ?? false), in a view controller, instead of self.view.isEnabled).
I always redefine @IBOutlets from:
@IBOutlet weak var someView!
to: @IBOutlet weak var someView?
I'm kind of a "belt & suspenders" type of guy.monocularvision
7 hours ago
So what happens if it ends up being nil? How does your app react?
In this particular case, I would rather crash. It’s easier to spot in a crash report and you get a nice stack trace.
Silent failure is ultimately terrible for users.
Note: for the things I control I try to very explicitly model state in such a way as I never need to force unwrap at all. But for things beyond my control like this situation, I would rather end the program than continue with a state of the world I don’t understand.
Pulcinella
7 hours ago
Yeah @IBOutlets are generally the one thing that are allowed to be implicitly-unwrapped optionals. They go along with using storyboards & xibs files with Interface Builder. I agree that you really should just crash if you are attempting to access one and it is nil. Either you have done something completely incorrect with regards to initializing and accessing parts of your UI and want to catch that in development, or something has gone horribly, horribly, horribly with UIKit/AppKit and storyboard/xib files are not being loaded properly by the system.
ChrisMarshallNY
4 hours ago
> … you really should just crash if …
See my above/below comment.
A good tool for catching stuff during development, is the humble assert()[0]. We can use precondition()[1], to do the same thing, in ship code.
The main thing is, is to remain in control, as much as possible. Rather than let the PC leave the stack frame, throw the error immediately when it happens.
[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...
[1] https://docs.swift.org/swift-book/documentation/the-swift-pr...
ChrisMarshallNY
5 hours ago
> Silent failure is ultimately terrible for users.
Agreed.
Unfortunately, crashes in iOS are “silent failures,” and are a loss of control.
What this practice does, is give me the option to handle the failure “noisily,” and in a controlled manner; even if just emitting a log entry, before calling a system failure. That can be quite helpful, in threading. Also, it gives me the option to have a valid value applied, if there’s a structural failure.
But the main reason that I do that with @IBOutlets, is that it forces me to acknowledge, throughout the rest of the code, that it’s an optional. I could always treat implicit optionals as if they were explicit, anyway. This just forces me to.
I have a bunch of practices that folks can laugh at, but my stuff works pretty effectively, and I sleep well.
ozgrakkurt
5 hours ago
Not panicking code is tedious to write. It is not realistic to expect everything to be non panic. There is a reason that panicking exists in the first place.
Them calling unwrap on a limit check is the real issue imo. Everything that takes in external input should assume it is bad input and should be fuzz tested imo.
In the end, what is the point of having a limit check if you are just unwrapping on it
cube00
3 hours ago
> Not panicking code is tedious to write.
Using the question mark operator [1] and even adding in some anyhow::context goes a long way to being able to fail fast and return an Err rather then panicking.
Sure you need to handle Results all the way up the stack but it forces you to think about how those nested parts of your app will fail as you travel back up the stack.
[1]: https://doc.rust-lang.org/rust-by-example/std/result/questio...
branko_d
4 hours ago
Safe things should be easy, dangerous things should be hard.
This .unwrap() sounds too easy for what it does, certainly much easier than having an entire try..catch block with an explicit panic. Full disclosure: I don't actually know Rust.
kettlecorn
4 hours ago
I don't think 'unwrap' is inherently the problem.
Any project has to reason about what sort of errors can be tolerated gracefully and which cannot. Unwrap is reasonable in scenarios you expect to never be reached, because otherwise your code will be full of all sorts of possible permutations and paths that are harder to reason about and may cascade into extremely nuanced or subtle errors.
Rust also has a version of unwrap called "expect" where you provide a string that logs why the unwrap occurred. It's similar, but for pieces of code that are crucial it could be a good idea to require all 'unwraps' to instead be 'expects' so that people at least are forced to write down a reason why they believe the unwrap can never be reached.
rafaelmn
7 hours ago
That's such a bad take after reading the article. If you're going to write a system that preallocates and is based on hard assumptions about max size - the panic/unwrap approach is reasonable.
The config bug reaching prod without this being caught and pinpointed immediately is the strange part.
kevin_thibedeau
7 hours ago
It's reasonable when testing protocols exercise the panic scenario. This is the problem with punting on error recovery. Nobody checks faults that propagate across domains of responsibility.
thatoneengineer
6 hours ago
I agree there's no way to soft-error this, though "truncate and raise an alert" is arguably the better pattern.
AtNightWeCode
6 hours ago
Exactly. The newbie mistake in SQL is also way worse than this. But the whole design is also bad. Clearly implementing things at the wrong place.
And, it took like over an hour between the problem started til my sites went down. That is just crazy.
cvhc
11 hours ago
Some languages and style guides simply forbid throwing exceptions without catching / proper recovery. Google C++ bans exceptions and the main mechanism for propogating errors is `absl::Status` which the caller has to check. Not familiar with Rust but it seems unwrap is such a thing that would be banned.
kibwen
9 hours ago
> Not familiar with Rust but it seems unwrap is such a thing that would be banned.
Panics aren't exceptions, any "panic" in Rust can be thought of as an abort of the process (Rust binaries have the explicit option to implement panics as aborts). Companies like Dropbox do exactly this in their similar Rust-based systems, so it wouldn't surprise me if Cloudflare does the same.
"Banning exceptions" wouldn't have done anything here, what you're looking for is "banning partial functions" (in the Haskell sense).
cvhc
7 hours ago
Yeah I know but isn't unwrap basically a trivial way to: (1) give up catching the exception/error (the E part in `Result<T, E>`) that the callee throws; and also (2) escalate it to the point that nothing can catch it. It has such a benign name.
gpm
10 hours ago
Unwrap is used in places where in C++ you would just have undefined behavior. It wouldn't make any more sense to blanket ban it than it would ban ever dereferencing a pointer just in case its null - even if you just checked that it wasn't null.
cherryteastain
10 hours ago
Rust's Result is the same thing as C++'s std::expected. How is calling std::expected::value undefined behaviour?
gpm
10 hours ago
Rust's foo: Option<&T> is rust's rough equivalent to C++'s const T* foo. The C++ *foo is equivalent to the rust unsafe{ *foo.unwrap_unchecked() }, or in safe code *foo.unwrap() (which changes the undefined behavior to a panic).
Rust's unwrap isn't the same as std::expected::value. The former panics - i.e. either aborts the program or unwinds depending on context and is generally not meant to be handled. The latter just throws an exception that is generally expected to be handled. Panics and exceptions use similar machinery (at least they can depending on compiler options) but they are not equivalent - for example nested panics in destructors always abort the program.
In code that isn't meant to crash `unwind` should be treated as a sign saying that "I'm promising that this will never happen", but just like in C++ where you promise that pointers you deference are valid and signed integers you add don't overflow making promises like that is a necessary part of productive programming.
9029
9 hours ago
Tangential but funnily enough calling std::expected::error is ub if there is no error :D
pdimitar
10 hours ago
There are even lints for this but people get impatient and just override them or fight for them to no longer be the default.
As usual: people problem, not a tech problem. In the last years a lot of strides have been made. But people will be people.
lenkite
6 hours ago
Linting is not good enough. The compiler should refuse to compile the code without it marked with an explicit annotation. Too much Rust code is panic happy since using casual use of `unwrap` is perma-etched into everyone's minds by the amount of demo code out there using unwrap.
pdimitar
5 hours ago
I completely agree. But IMO Rust would have not gained traction if it was as strict. It would be branded as an academic toy language.
But now after we are past that and it has a lot of mind share, I'd say it's time to start tightening the bolts.
drexlspivey
3 hours ago
!#[deny(unwrap_used)] at the top of the file would take care of it.
tonyhart7
10 hours ago
and people make mistake
at some point machine would be better in coding because well machine code is machine instruction task
same like chess, engine is better than human grandmaster because its solvable math field
coding is no different
aw1621107
8 hours ago
> same like chess, engine is better than human grandmaster because its solvable math field
Might be worth noting that your description of chess is slightly incorrect. Chess technically isn't solved in the sense that the optimal move is known for any arbitrary position is known; it's just that chess engines are using what amounts to a fancy brute force for most of the game and the combination of hardware and search algorithm produces a better result than the human brain does. As such, chess engines are still capable of making mistakes, even if actually exploiting them is a challenge.
tonyhart7
7 hours ago
No ?????? because these thing called BEST MOVE and BAD MOVE there in chess
"chess engines are still capable of making mistakes", I'm sorry no
inaccurate yes but not mistake
aw1621107
6 hours ago
> because these thing called BEST MOVE and BAD MOVE there in chess
The thing is that there is no known general objective criteria for "best" and "bad" moves. The best we have so far is based on engine evaluations, but as I said before that is because chess engines are better at searching the board's state space than humans, not because chess engines have solved chess in the mathematical sense. Engines are quite capable of misevaluating positions, as demonstrated quite well by the Top Chess Engine Championship [0] where one engine thinks it made a good move while the other thinks that move is bad, and this is especially the case when resources are limited.
The closest we are to solving chess are via tablebases, which are far from covering the entire state space and are basically as much of an exemplar of pure brute force as you can get.
> "chess engines are still capable of making mistakes", I'm sorry no
If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist? Surely if chess engines could not make mistakes they would always agree on a position's evaluation and what move should be made, and therefore such an exercise would be pointless?
> inaccurate yes but not mistake
From the perspective to attaining perfect play an inaccuracy is a mistake.
[0]: https://en.wikipedia.org/wiki/Top_Chess_Engine_Championship
tonyhart7
5 hours ago
"The thing is that there is no known general objective criteria for "best" and "bad" moves."
are you playing chess or not?????? if you playing chess then its oblivious how to differentiate bad move and best move
Yes it is objective, these thing called best move not without reason
"If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist?"
to create better chess engine like what do even talking about here????, are you saying just because there are older bad engine that mean this thing is pointless ????
if you playing chess up to a decent level 1700+ (like me), you know that these argument its wrong and I assure you to learn chess to a decent level
up until that point that you know high level chess is brute force games and therefore solvable math
eertami
41 minutes ago
> if you playing chess up to a decent level 1700+ (like me), you know that these argument its wrong and I assure you to learn chess to a decent level
In a fascinating coincidence, there is a tonyhart7 on both chess.com and lichess, and they have been banned for cheating on both websites.
aw1621107
5 hours ago
> if you playing chess then its oblivious how to differentiate bad move and best move
The key words in what I said are "general" and "objective". Yes, it's possible to determine "good" or "bad" moves in specific positions. There's no known method to determine "good" or "bad" moves in arbitrary positions, as would be required for chess to be considered strongly solved.
Furthermore, if it's "obvious" how to differentiate good and bad moves then we should never see engines blundering, right?
So (for example) how do you explain this game between Stockfish and Leela where Stockfish blunders a seemingly winning position [0]? After 37... Rdd8 both Stockfish and Leela think white is clearly winning (Stockfish's evaluation is +4.00, while Leela's evaluation is +3.81), but after 38. Nxb5 Leela's evaluation plummets to +0.34 while Stockfish's evaluation remains at +4.00. In the end, it turns out Leela was correct after 40... Rxc6 Stockfish's evaluation also drops from +4.28 to 0.00 as it realizes that Leela has a forced stalemate.
Or this game also between Stockfish and Leela where Leela blunders into a forced mating sequence and doesn't even realize it for a few moves [1]?
Engines will presumably always play what they think is the "best" move, but clearly sometimes this "best" move is wrong. Evidently, this means differentiating "good" and "bad" moves is not always obvious.
> Yes it is objective, these thing called best move not without reason
If it's objective, then why is it possible for engines to disagree on whether a move is good or bad, as they do in the above example and others?
> to create better chess engine like what do even talking about here????
The ability to create better chess engines necessarily implies that chess engines can and do make mistakes, contrary to what you asserted.
> are you saying just because there are older bad engine that mean this thing is pointless ????
No. What I'm saying is that your explanation for why chess engines are better than humans is wrong. Chess engines are not better than humans because they have solved chess in the mathematical sense; chess engines are better than humans because they search the state space faster and more efficiently than humans (at least until you reach 7 pieces on the board).
> up until that point that you know high level chess is brute force games and therefore solvable math
"Solvable" and "solved" are two very different things. Chess is solvable, in theory. Chess is very far from being solved.
[0]: https://www.chess.com/computer-chess-championship#event=309&...
[1]: https://www.chess.com/computer-chess-championship#event=309&...
ajross
10 hours ago
I'm not completely sure I agree. I mean, I do agree about the .unwrap() culture being a bug trap. But I don't think this example qualifies.
The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".
But if that's a validation failure, well, failing is correct? What wasn't correct was that the failure reached production. What should have happened is that the validation should have been a unified thing and whatever generated the file should have flagged it before it entered production.
And that's not an issue with function return value API management. The software that should have bailed was somewhere else entirely, and even there an unwrap explosion (in a smoke test or pre-release pass or whatever) would have been fine.
crote
9 hours ago
It sounds to me like there was validation, but the system wasn't designed for the validation to ever fail - at which point crashing is the only remaining option. You've essentially turned it into an assertion error rather than a parsing/validation error.
Ideally every validation should have a well-defined failure path. In the case of a config file rotation, validation failure of the new config could mean keeping the old config and logging a high-priority error message. In the case of malformed user-provided data, it might mean dropping the request and maybe logging it for security analysis reasons. In the case of "pi suddenly equals 4" checks the most logical approach might be to intentionally crash, as there's obviously something seriously wrong and application state has corrupted in such a way that any attempt to continue is only going to make things worse.
But in all cases there's a reason behind the post-validation-failure behavior. At a certain point leaving it up to "whatever happens on .unwrap() failure" isn't good enough anymore.
karel-3d
4 hours ago
As a gopher I never understand why is there so many unwraps in an average rust code.
Average Go code has much less panics than Rust has unwraps, which are functionally equivalent.
j-krieger
2 hours ago
I love Go and write a ton of it. I've had real segfaults quite a lot.
speedgoose
4 hours ago
The average golang code segfaults by design.
richardwhiuk
4 hours ago
Because Go silently gives you zero/null instead
karel-3d
3 hours ago
Idiomatically, it gives you `err` and you do `if err != nil {return err}`. While in rust you mostly do `.unwrap` and panic.
It's not in the type system, but it's idiomatic
ergocoder
4 hours ago
which mean an unexpected behavior could go unnoticed for a long time.
I'd prefer a loud crash over that.
karel-3d
3 hours ago
Well look at the failure modes in the original article.
In the original PHP code, all worked, only it didn't properly check for bots.
The new Rust code did a loud crash and took off half of the internet.
throwaway38294
5 hours ago
This is a bummer. The unwrap()'ing function already returned a result and should have just propagated the error. Presumably the caller could have handled more sensibly than just panic'ing.
pjmlp
5 hours ago
Which is something I will bookmark for the usual Rust doesn't do exceptions discussions, except it kind of does even if called differently.
antonvs
8 hours ago
> This is textbook "parse, don't validate" anti-pattern.
How so? “Parse, don’t validate” implies converting input into typed values that prevent representation of invalid state. But the parsing still needs to be done correctly. An unchecked unwrap really has nothing to do with this.
kccqzy
6 hours ago
GP completely misunderstands “parse, don’t validate” and also calls it an anti-pattern. GP clearly has no idea what this is.
nrhrjrjrjtntbt
9 hours ago
I wonder what happens if they handle it gracefully? sounds like performance degradation (better than reliability degradation!).
Also wonder with a sharded system why are they not slow rolling out changes and monitoring?
shadowgovt
10 hours ago
In addition, it looks like this system wasn't on any kind of 1%/10%/50%/100% rollout gating. Such a rollout would trivially have shown the poison input killing tasks.
penteract
9 hours ago
To me it reads like there was a gradual rollout of the faulty software responsible for generating the config files, but those files are generated on approximately one machine, then propogated across the whole network every 5 minutes.
> Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
helloericsf
9 hours ago
Not a DBA, how do you do DB permission rollout gating?
shadowgovt
6 hours ago
It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.
The file should be versioned and rollout of new versions should be staged.
(There is definitely a trade-off; often times in the security critical path, you want to go as fast as possible because changes may be blocking a malicious actor. But if you move too fast, you break things. Here, they had a potential poison input in the pathway for synchronizing this state and Murphy's Law suggests it was going to break eventually, so the question becomes "How much damage can we tolerate when it does?")
dwattttt
4 hours ago
> It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.
That feature file is generated every 5 minutes at all times; the change to permissions was rolled out gradually over the clickhouse cluster, and whether a bad version of that file was generated depended on whether the part of the cluster that had the bad permissions generated the file.
__bax
2 hours ago
git blame on .unwrap() line
arccy
11 hours ago
if you make it easy to be lazy and panic vs properly handling the error, you've designed a poor language
nine_k
10 hours ago
At Facebook they name certain "escape hatch" functions in a way that inescapably make them look like a GIANT EYESORE. Stuff like DANGEROUSLY_CAST_THIS_TO_THAT, or INVOKE_SUPER_EXPENSIVE_ACTION_SEE_YOU_ON_CODE_REVIEW. This really drives home the point that such things must not be used except in rare extraordinary cases.
If unwrap() were named UNWRAP_OR_PANIC(), it would be used much less glibly. Even more, I wish there existed a super strict mode when all places that can panic are treated as compile-time errors, except those specifically wrapped in some may_panic_intentionally!() or similar.
adzm
10 hours ago
> make them look like a GIANT EYESORE
React.__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED comes to mind. I did have to reach to this before, but it certainly works for keeping this out of example code and other things like reading other implementations without the danger being very apparent.
At some point it was renamed to __CLIENT_INTERNALS_DO_NOT_USE_OR_WARN_USERS_THEY_CANNOT_UPGRADE which is much less fun.
porridgeraisin
5 hours ago
Nathanba
10 hours ago
right and if the language designers named it UNWRAP_OR_PANIC() then people would rightfully be asking why on earth we can't just use a try-catch around code and have an easier life
nine_k
10 hours ago
But a panic can be caught and handled safely (e.g. via std:: panic tools). I'd say that this is the correct use case for exceptions (ask Martin Fowler, of all people).
There is already a try/catch around that code, which produces the Result type, which you can presumptuously .unwrap() without checking if it contains an error.
Instead, one should use the question mark operator, that immediately returns the error from the current function if a Result is an error. This is exactly similar to rethrowing an exception, but only requires typing one character, the "?".
yoyohello13
10 hours ago
Probably not, since errors as values are way better than exceptions.
nomel
10 hours ago
How so? An exception is a value that's given the closest, conceptually appropriate, point that was decided to handle the value, allowing you to keep your "happy path" as clean code, and your "exceptional circumstances" path at the level of abstraction that makes sense.
It's way less book-keeping with exceptions, since you, intentionally, don't have to write code for that exceptional behavior, except where it makes sense to. The return by value method, necessarily, implements the same behavior, where handling is bubbled up to the conceptually appropriate place, through returns, but with much more typing involved. Care is required for either, since not properly bubbling up an exception can happen in either case (no re-raise for exceptions, no return after handling for return).
yoyohello13
10 hours ago
There are many many pages of text discussing this topic, but having programmed in both styles, exceptions make it too easy for programmer to simply ignore them. Errors as values force you to explicitly handle it there, or toss it up the stack. Maybe some other languages have better exception handling but in Python it’s god awful. In big projects you can basically never know when or how something can fail.
nomel
10 hours ago
I would claim the opposite. If you don't catch an exception, you'll get a halt.
With return values, you can trivially ignore an exception.
let _ = fs::remove_file("file_doesn't_exist");
or
value, error = some_function()
// carry on without doing anything with error
In the wild, I've seen far more ignoring return errors, because of the mechanical burden of having type handling at every function call.This is backed by decades of writing libraries. I've tried to implement libraries without exceptions, and was my admittedly cargo-cult preference long ago, but ignoring errors was so prevalent among the users of all the libraries that I now always include a "raise" type boolean that defaults to True for any exception that returns an error value, to force exceptions, and their handling, as default behavior.
> In big projects you can basically never know when or how something can fail.
How is this fundamentally different than return value? Looking at a high level function, you can't know how it will fail, you just know it did fail, from the error being bubbled up through the returns. The only difference is the mechanism for bubbling up the error.
Maybe some water is required for this flame war. ;)
yoyohello13
8 hours ago
I can agree to disagree :)
pyrolistical
10 hours ago
Exception is hidden control flow, where as error values are not.
That is the main reason why zig doesn’t have exceptions.
nomel
10 hours ago
I'd categorize them more as "event handlers" than "hidden". You can't know where the execution will go at a lower level, but that's the entire point: you don't care. You put the handlers at the points where you care.
bigstrat2003
8 hours ago
Correction: unchecked exceptions are hidden control flow. Checked exceptions are quite visible, and I think that more languages should use them as a result.
sfink
10 hours ago
...and you can? try-catch is usually less ergonomic than the various ways you can inspect a Result.
try {
data = some_sketchy_function();
} catch (e) {
handle the error;
}
vs result = some_sketchy_function();
if let Err(e) = result {
handle the error;
}
Or better yet, compare the problematic cases where the error isn't handled: data = some_sketchy_function();
vs data = some_sketchy_function().UNWRAP_OR_PANIC();
In the former (the try-catch version that doesn't try or catch), the lack of handling is silent. It might be fine! You might just depend on your caller using `try`. In the latter, the compiler forces you to use UNWRAP_OR_PANIC (or, in reality, just unwrap) or `data` won't be the expected type and you will quickly get a compile failure.What I suspect you mean, because it's a better argument, is:
try {
sketchy_function1();
sketchy_function2();
sketchy_function3();
sketchy_function4();
} catch (e) {
...
}
which is fair, although how often is it really the right thing to let all the errors from 4 independent sources flow together and then get picked apart after the fact by inspecting `e`? It's an easier life, but it's also one where subtle problems constantly creep in without the compiler having any visibility into them at all.SchwKatze
11 hours ago
Unwrap isn't a synonym for laziness, it's just like an assertion, when you do unwrap() you're saying the Result should NEVER fail, and if does, it should abort the whole process. What was wrong was the developer assumption, not the use of unwrap.
SchemaLoad
10 hours ago
It also makes it very obvious in the code, something very dangerous is happening here. As a code reviewer you should see an unwrap() and have alarm bells going off. While in other languages, critical errors are a lot more hidden.
dietr1ch
11 hours ago
> What was wrong was the developer assumption, not the use of unwrap.
How many times can you truly prove that an `unwrap()` is correct and that you also need that performance edge?
Ignoring the performance aspect that often comes from a hat-trick, to prove such a thing you need to be wary of the inner workings of a call giving you a `Return`. That knowledge is only valid at the time of writing your `unwrap()`, but won't necessarily hold later.
Also, aren't you implicitly forcing whoever changes the function to check for every smartass dev that decided to `unwrap` at their callsite? That's bonkers.
JuniperMesos
10 hours ago
I doubt that this unwrap was added for performance reasons; I suspect it was rather added because the developer temporarily didn't want to deal with what they thought was an unlikely error case while they were working on something else; and no other system recognized that the unwrap was left in and flagged it before it was deployed on production servers.
If I were Cloudflare I would immediately audit the codebase for all uses of unwrap (or similar rust panic idioms like expect), ensure that they are either removed or clearly documented as to why it's worth crashing the program there, and then add a linter to their CI system that will fire if anyone tries to check in a new commit with unwrap in it.
duped
9 hours ago
Panics are for unexpected error conditions, like your caller passed you garbage. Results are for expected errors, like your caller passed you something but it's your job to tell if it's garbage.
So the point of unwrap() is not to prove anything. Like an assertion it indicates a precondition of the function that the implementer cannot uphold. That's not to say unwrap() can't be used incorrectly. Just that it's a valid thing to do in your code.
Note that none of this is about performance.
Rohansi
10 hours ago
> when you do unwrap() you're saying the Result should NEVER fail
Returning a Result by definition means the method can fail.
Dylan16807
10 hours ago
> Returning a Result by definition means the method can fail.
No more than returning an int by definition means the method can return -2.
bigstrat2003
8 hours ago
> No more than returning an int by definition means the method can return -2.
What? Returning an int does in fact mean that the method can return -2. I have no idea what your argument is with this, because you seem to be disagreeing with the person while actually agreeing with them.
yoyohello13
10 hours ago
What? Results have a limited number of possible error states that are well defined.
Dylan16807
10 hours ago
Some call points to a function that returns a Result will never return an Error.
Some call points to a function that returns an int will never return -2.
Sometimes you know things the type system does not know.
Rohansi
10 hours ago
The difference is functions which return Result have explicitly chosen to return a Result because they can fail. Sure, it might not fail in the current implementation and/or configuration, but that could change later and you might not know until it causes problems. The type system is there to help you - why ignore it?
Dylan16807
10 hours ago
Because it would be a huge hassle to go into that library and write an alternate version that doesn't return a Result. So you're stuck with the type system being wrong in some way. You can add error-handling code upfront but it will be dead code at that point in time, which is also not good.
SchwKatze
10 hours ago
Yeah, I think I expressed wrongly here. A more correct version would be: "when you do unwrap() you're saying that an error on this particular path shouldn't be recoverable and we should fail-safe."
Buttons840
9 hours ago
Division can fail, but I can write a program where I'm sure division will not fail.
JuniperMesos
10 hours ago
It's a little subtler than this. You want it to be easy to not handle an error while developing, so you can focus on getting the core logic correct before error-handling; but you want it to be hard to deploy or release the software without fully handling these checks. Some kind of debug vs release mode with different lints seems like a reasonable approach.
j-krieger
2 hours ago
This is untrue. A `?` operator would have done just fine here. I agree with you though that it should be possible to explicitely forbid unwraps.
yoyohello13
11 hours ago
So… basically every language ever?
Except maybe Haskell.
yakshaving_jgt
2 hours ago
It's easy to cause this kind of failure in Haskell also.
dkersten
11 hours ago
And Gleam
kibwen
8 hours ago
In Rust, `.unwrap()` is nine characters, whereas propagating the Result via `?` is one.
leshenka
10 hours ago
All languages with few exceptions have these kinds of escape hatches like unwrap
otterley
11 hours ago
nine_k
11 hours ago
Works when you have the Erlang system that does graceful handing for you: reporting, restarting.
guluarte
9 hours ago
it's usually because of fail fast and fail hard, in theory critical bugs will be caught in dev/test
hoppp
6 hours ago
You write so much rust you causally apply unwrap now to everything?
Rust compiler is a god of sorts, or at least a law of nature haha
Way to comment and go instantly off topic