jrochkind1
8 months ago
I don't know how long that failure mode has been in place or if this is relevant, but it makes me think of analogous times I've encountered similar:
When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.
In that situation, switching back to yesterday's workflow is something that won't interrupt much.
A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.
The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.
ronsor
8 months ago
> The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.
The solution to this is to trigger all functionality periodically and randomly to ensure it remains tested. If you don't test your backups, you don't have any.
jvanderbot
8 months ago
That is "a" solution.
Another solution that is very foreign to us in sweng, but is common practice in, say, aviation, is to have that fallback plan in a big thick book, and to have a light that says "Oh it's time to use the fallback plan", rather than require users to diagnose the issue and remember the fallback.
This was one of the key ideas in the design of critical systems*: Instead of automating the execution of a big branching plan, it is often preferable to automate just the detection of the next desirable state, then let the users execute the transition. This is because, if there is time, it allows all users to be fully cognizant of the inner state of the system and the reasons for that state, in case they need to take over.
The worst of both worlds is to automate yourself into a corner, gunk everything up, and then require the user to come in and do a back-breaking cleanup just to get to the point where they can diagnose this. My factorio experiences mirror this last case perfectly.
* "Joint Cognitive Systems" - Hollnagle&Woods
Thorrez
8 months ago
Isn't what you're proposing exactly what led to this being a major problem? The automated systems disabled themselves, so people had to use the manual way, which was much less efficient, and 1,500 flights had to be cancelled.
thereddaikon
8 months ago
They are referring to air crew procedures, not ATC. WHen the crew of an aircraft encounter a failure that doesn't have a common simple response, they consult a procedure book. This is something professional crews are well acquainted with and used to. The problem in the article was with the air traffic control system. They did not have a proper failback procedure and it caused major disruptions.
jrochkind1
8 months ago
Their fallback procedure was "do it the manual way". Just like for the pilots. They thought it was a proper one...
jvanderbot
8 months ago
This entire subthread is more a response to the suggestion that "The" solution is fuzzing your entire stack to death.
ericjmorey
8 months ago
Which company deployed a chaos monkey deamon on their systems? Seemed to improve resiliency when I read about it.
theolivenbaum
8 months ago
Netflix did that many years ago, interesting idea even if a bit disruptive in the beginning https://netflix.github.io/chaosmonkey/
e28eta
8 months ago
See also, the rest of the simian army: https://netflixtechblog.com/the-netflix-simian-army-16e57fba...
philsnow
8 months ago
At Google, the global Chubby cell had gone so long without any downtime that people were starting to assume that’s it was just always available, leading to some kind of outage or other when the global cell finally did have some organic downtime.
Chubby-SRE added quarterly synthetic downtime of the global cell (iff the downtime SLA had not already been exceeded).
kelnos
8 months ago
For those of us who haven't worked at Google, what's "Chubby" and what's a "cell"?
philsnow
8 months ago
Ah, chubby is a distributed lock service. Think “zookeeper” and you won’t be far off.
https://static.googleusercontent.com/media/research.google.c... [pdf]
Some random blog post: https://medium.com/coinmonks/chubby-a-centralized-lock-servi...
You can run multiple copies/instances of chubby at the same time (like you could run two separate zookeepers). You usually run an odd number of them, typically 5. A group of chubby processes all managing the same namespace is a “cell”.
A while ago, nearly everything at Google had at least an indirect dependency on chubby being available (for service discovery etc), so part of the standard bringup for a datacenter was setting up a dc-specific chubby cell. You could have multiple SRE-managed chubby cells per datacenter/cluster if there was some reason for it. Anybody could run their own, but chubby-sre wasn’t responsible for anybody else’s, I think.
Finally, there was a global cell. It was both distributed across multiple datacenters and also contained endpoint information for the per-dc chubby cells, so if a brand new process woke up somewhere and all it knew how to access was the global chubby cell, it could bootstrap from that to talking to chubby in any datacenter and thence to any other process anywhere, more or less.
^ there’s a lot in there that I’m fuzzy about, maybe processes wake up and only know how to access local chubby, but that cell has endpoint info for the global one? I don’t think any part of this process used dns; service discovery (including how to discover the service discovery service) was done through chubby.
starspangled
8 months ago
Not trying to "challenge" your story, and its interesting anecdote in context. But if you have time to indulge me (and I'm not a real expert at distributed systems, which might be obvious) -
Why would you have a distributed lock service that (if I read right) has multiple redundant processes that can tolerate failures... and then require clients tolerate outages? Isn't the purpose of this kind of architecture so that each client doen't have to deal with outages?
saalweachter
8 months ago
Because you want the failure modes to be graceful and recovery to be automatic.
When the foundation of a technology stack has a failure, there are two different axis of failure.
1. How well do things keep working without the root service? Does every service that can be provided without it still keep going?
2. How automatically does the system recover when the root service is restored? Do you need to bring down the entire system and restore it in a precise order of dependencies?
It's nice if your system can tolerate the missing service and keep chugging along, but it is essential that your system not deadlock on the root service disappearing and stay deadlocked after the service is restored. At best, that turns a downtime of minutes into a downtime of hours, as you carefully turn down every service and bring them back up in a carefully proscribed order. At worse, you discover that your system that hasn't gone down in three years has acquired circular dependencies among its services, and you need to devise new fixes and work-arounds to allow it to be brought back up at all.
praptak
8 months ago
First, a global system with no outages (say the gold standard of 99.999% availability) is a promise which is basically impossible to keep.
Second, a global system being always available definitely doesn't mean it is always available everywhere. A single datacenter or even a larger region will experience both outages and network splits. It means that whatever you design on top of the super-available global system will have to deal with the global system being unavailable anyway.
TLDR is that the clients will have to tolerate outages (or at least frequent cut offs from the "global" state") anyway so it's better not to give them false promises.
chili6426
8 months ago
I don't work at Google but I do live in a country where I have access to google.com. Chubby is a lock service that is used internally at Google and a cell is referring to an instance of Chubby. You can read more here: https://static.googleusercontent.com/media/research.google.c...
nine_k
8 months ago
Replace this with "API gateway cluster", or basically any simple enough, very widely used service.
RcouF1uZ4gsC
8 months ago
The same company that was in the news recently for screwing up a livestream of a boxing match.
chrisweekly
8 months ago
True, but it's the exception that proves the rule; it's also the same company responsible for delivering a staggeringly high percentage of internet video, typically without a hitch.
tovej
8 months ago
That's not what an exception proving a rule means. It has a technical meaning: a sign that says "free parking on sundays" implies parking is not free as a rule.
When used like this it just confuses a reader with rethoric. In this case netflix is just bad at live streaming, they clearly haven't done the necessary engineering work on it.
nine_k
8 months ago
The fact that Netflix surprised so many people by an exceptional technical issue implies that as a rule Netflix delivers video smoothly and at any scale necessary.
chrisweekly
8 months ago
Yes! THIS is precisely what I meant in my comment.
jjk166
8 months ago
That's also not what "an exception proving the rule" is either. The term comes from a now mostly obsolete* meaning of prove meaning "to test or trial" something. So the idiom properly means "the exception puts the rule to the test." If there is an exception, it means the rule was broken. The idiom has taken on the opposite meaning due to its frequent misuse, which may have started out tongue in cheek but now is used unironically. It's much like using literally to describe something which is figurative.
* This is also where we get terms like bulletproof - in the early days of firearms people wanted armor that would stop bullets from the relatively weak weapons, so armor smiths would shoot their work to prove them against bullets, and those that passed the test were bullet proof. Likewise alcohol proof rating comes from a test used to prove alcohol in the 1500s.
lelanthran
8 months ago
> That's not what an exception proving a rule means. It has a technical meaning: a sign that says "free parking on sundays" implies parking is not free as a rule.
So the rule is "Free parking on Sundays", and the exception that proves it is "Free parking on Sundays"? That's a post-hoc (circular) argument that does not convince me at all.
I read a different explanation of this phrase on HN recently: the "prove" in "exception proves the rule" has the same meaning as the "prove" (or "proof") in "50% proof alcohol".
AIUI, in this context "proof" means "tests". The exception that tests the rule simply shows where the limits of the rules actually are.
Well, that's how I understood it, anyway. Made sense to me at the time I read the explanation, but I'm open to being convinced otherwise with sufficiently persuasive logic :-)
taejo
8 months ago
The rule is non-free parking. The exception is Sundays.
lucianbr
8 months ago
The meaning of a word or expression is not a matter of persuasive logic. It just means what people think it means. (Otherwise using it would not work to communicate.) That is why a dictionary is not a collection of theorems. Can you provide a persuasive logic for the meaning of the word "yes"?
https://en.wikipedia.org/wiki/Exception_that_proves_the_rule
Seems like both interpretations are used widely.
tsimionescu
8 months ago
The origin of the phrase is the aphorism that "all rules have an exception". So, when someone claims something is a rule and you find an exception, that's just the exception that proves it's a real rule. It's a joke, essentially, based on the common-sense meaning of the word "rule" (which is much less strict than the mathematical word "rule").
seaal
8 months ago
50% proof alcohol? That isn’t how that works. It’s 50% ABV aka 100 proof.
lelanthran
8 months ago
> 50% proof alcohol? That isn’t how that works. It’s 50% ABV aka 100 proof.
50% proof wouldn't be 25% ABV?
oofabz
8 months ago
Since 50% = 0.5, and proof doesn't take a percentage, I believe "50% proof" would be 0.25% ABV.
lelanthran
8 months ago
Good catch :-)
eru
8 months ago
Yes, though serving static files is easier than streaming live.
bitwize
8 months ago
The chaos monkey is there to remind you to always mount a scratch monkey.
amelius
8 months ago
"Your flight has been delayed due to Chaos Monkey."
nine_k
8 months ago
This means a major system problem. The point of the Chaos Monkey is that the system should function without interruptions or problems despite the activity of the Chaos Monkey. That is, it is to keep the system in such a shape that it could swallow and overcome some rate of failure, higher than the "naturally occurring" rate.
bigiain
8 months ago
"My name is Susie and I'll be the purser on your flight today, and on behalf of the Captain Chaos Monkey and the First Officer Chaos Monkey... Oh. Shit..."
user
8 months ago
telgareith
8 months ago
Dig into the OpenZFS 2.2.0 data loss bug story. There was at least one ticket (in FreeBSD) where it cropped up almost a year prior and got labeled "look into layer," but it got closed.
I'm aware closing tickets of "future investigation" tasks when it seems to not be an issue any longer is common. But, it shouldnt be.
Arainach
8 months ago
>it shouldnt be
Software can (maybe) be perfect, or it can be relevant to a large user base. It cannot be both.
With an enormous budget and a strictly controlled scope (spacecraft) it may be possible to achieve defect-free software.
In most cases it is not. There are always finite resources, and almost always more ideas than it takes time to implement.
If you are trying to make money, is it worth chasing down issues that affect a miniscule fraction of users that take eng time which could be spent on architectural improvements, features, or bugs affecting more people?
If you are an open source or passion project, is it worth your contributors' limited hours, and will trying to insist people chase down everything drive your contributors away?
The reality in any sufficiently large project is that the bug database will only grow over time. If you leave open every old request and report at P3, users will grow just as disillusioned as if you were honest and closed them as "won't fix". Having thousands of open issues that will never be worked on pollutes the database and makes it harder to keep track of the issues which DO matter.
Shorel
8 months ago
I'm in total disagreement with your last paragraph.
In fact, I can't see how it follows from the rest.
Software can have defects, true. There are finite resources, true. So keep the tickets open. Eventually someone will fix them.
Closing something for spurious psychological reasons seems detrimental to actual engineering and it doesn't actually avoid any real problem.
Let me repeat that: ignoring a problem doesn't make it disappear.
Keep the tickets open.
Anything else is supporting a lie.
lmm
8 months ago
> There are finite resources, true. So keep the tickets open. Eventually someone will fix them.
Realistically, no, they won't. If the rate of new P0-P2 bugs is higher than the rate of fixing being done, then the P3 bugs will never be fixed. Certainly by the time someone gets around to trying to fix the bug, the ticket will be far enough out of date that that person will not be able to trust it. There is zero value in keeping the ticket around.
> Anything else is supporting a lie.
Now who's prioritising "spurious psychological reasons" over the things that actually matter? Closing the ticket as wontfix isn't denying that the bug exists, it's acknowledging that the bug won't be fixed. Which is much less of a lie than leaving it open.
YokoZar
8 months ago
Every once in a while I get an email about a ten plus year old bug finally getting fixed in some open source project. If it's a good bug accurately describing a real thing, there's no reason to throw that work away rather than just marking it lower priority.
lmm
8 months ago
> If it's a good bug accurately describing a real thing, there's no reason to throw that work away rather than just marking it lower priority.
Perhaps. But the triage to separate the "good bugs accurately describing real things" from the chaff isn't free either.
mithametacs
8 months ago
Storage is cheap, database indexes work. Just add a `TimesChecked` counter to your bug tracker.
Now when considering priorities, consider not just impact and age, but also times checked.
It's less expensive than going and deleting stuff. Unless you're automating deletions? In which case... I don't think I can continue this discussion.
lmm
8 months ago
In what world is adding a custom field to a bug tracker and maintaining it cheaper than anything? If someone proves out this workflow and releases a bug tracker that has the functionality built in then I'll consider adopting it, but I'm certainly not going to be the first.
mithametacs
8 months ago
Wild.
Okay, keep deleting bug tickets.
Arainach
8 months ago
We already have a TimesChecked counter:
If (status != Active) { /* timesChecked > 0 */ }
lesuorac
8 months ago
But then when somebody else has the issue they make a new bug and any data/investigation from the old one is basically lost.
Like what's wrong with having 1000 open bugs?
Arainach
8 months ago
In becomes functionally impossible to measure and track tech debt. Not all of those issues are tech debt - things which will never be fixed don't matter.
Put another way: you're working on a new version of your product. There are 900 issues in the tracker. Is this an urgent emergency where you need to shut down feature work and stabilize?
If you keep a clean work tracker where things that are open mean work that should get done: absolutely
If you just track everything ever and started this release with 1100 issues: no, not necessarily.
But wait, of those 900 issues are there any that should block the release? Now you have 900 to go through and determine. And unless you won't fix some of them you'll have the same thing in a few months.
Work planning, triage, and other team tasks are not magic and voodoo, but on my experience the same engineers who object to the idea of "won't fix" are the ones who want to just code all day and never have to deal with the impact of a huge messy issue database on team and product.
lesuorac
8 months ago
> Not all of those issues are tech debt - things which will never be fixed don't matter.
Except that I've spent a good amount of time fixing bugs originally marked as `won't fix` because they actually became uh "will fix" (a decade later; lol).
> Put another way: you're working on a new version of your product. There are 900 issues in the tracker. Is this an urgent emergency where you need to shut down feature work and stabilize?
Do you not prioritize your bugs?
If the tracker is full of low priority bugs then it doesn't block the release. One thing we do is even if the bug would be high priority; if it's not new (as-in occurs in older releases) it doesn't (by default) block the next release.
> But wait, of those 900 issues are there any that should block the release? Now you have 900 to go through and determine. And unless you won't fix some of them you'll have the same thing in a few months.
You should only need to triage the bug once. It should be the same amount of work to triage a bug into low-priority as it is to mark it as `won't fix`. With (again), the big difference between that if a user searches for the bug they can find it and ideally keep updating the original bug instead of making a dozen new ones that need to be de-duplicated and triaged which is _more work_ not _less work_ for triagers.
> Work planning, triage, and other team tasks are not magic and voodoo, but on my experience the same engineers who object to the idea of "won't fix" are the ones who want to just code all day and never have to deal with the impact of a huge messy issue database on team and product.
If your idea of the product is ready for release is 0 bugs filed then that's something you're going to want to change. Every software gets released with bugs; often known bugs.
I will concede that if you "stop the count" or "stop testing" then yeah you'll have no issues reported. Doesn't make it the truth.
kelnos
8 months ago
That's absurd. Closing those issues doesn't make them go away, it just causes you to forget them (until someone else reports them again and someone creates a new issue, losing all the previous context). If you just leave them open, some will eventually get fixed, many will not, and that's fine.
The decision between feature work vs. maintenance work in a company is driven by business needs, not by the number of bugs open in the issue tracker. If anything, keeping real bugs open helps business leaders actually determine their business needs more effectively. Closing them unfixed is the equivalent of putting your head in the sand.
lmm
8 months ago
> But then when somebody else has the issue they make a new bug and any data/investigation from the old one is basically lost.
You keep the record of the bug, someone searching for the symptoms can find the wontfix bug. Ideally you put it in the program documentation as a known issue. You just don't keep it open, because it's never going to be worked on.
> Like what's wrong with having 1000 open bugs?
Noise, and creating misleading expectations.
kelnos
8 months ago
> If the rate of new P0-P2 bugs is higher than the rate of fixing being done, then the P3 bugs will never be fixed.
That's quite a big assumption. Every company I've worked at where that was the case had terrible culture and constantly shipped buggy crap. Not really the kind of environment that I'd use to set policy or best practices.
lmm
8 months ago
If you're in the kind of environment where you fix all your bugs then you don't have a ballooning bug backlog and the problem never arises. I've worked in places that fixed all their bugs, but to my mind that was more because they didn't produce the kind of product that has P3 bugs than because they had better culture or something.
Arainach
8 months ago
It's not "spurious psychological reasons". It is being honest that issues will never, ever meet the bar to be fixed. Pretending otherwise by leaving them open and ranking them in the backlog is a waste of time and attention.
ryandrake
8 months ago
I've seen both types of organizations:
1. The bug tracker is there to document and prioritize the list of bugs that we know about, whether or not they will ever be fixed. In this world, if it's a real issue, it's tracked and kept while it exists in the software, even though it might be trivial, difficult, or just not worth fixing. There's no such thing as closing the bug as "Won't Fix" or "Too Old". Further, there's no expectation that any particular bug is being worked on or will ever be fixed. Teams might run through the bug list periodically to close issues that no longer reproduce.
2. The bug tracker tracks engineering load: the working set of bugs that are worthy of being fixed and have a chance to be fixed. Just because the issue is real, doesn't mean it's going to be fixed. So file the bug, but it may be closed if it is not going to be worked on. It also may be closed if it gets old and it's obvious it will never get worked on. In this model, every bug in the tracker is expected to be resolved at some point. Teams will run through the bug list periodically to close issues that we've lived with for a long time and just won't be fixed ever.
I think both are valid, but as a software organization, you need to agree on which model you're using.
gbear605
8 months ago
There have been a couple times in the past where I’ve run into an issue marked as WONT FIX and then resolved it on my end (because it was luckily an open source project). If the ticket were still open, it would have been trivial to put up a fix, but instead it was a lot more annoying (and in one of the cases, I just didn’t bother). Sure, maybe the issue is so low priority that it wouldn’t even be worth reviewing a fix, and this doesn’t apply for closed source projects, but otherwise you’re just losing out on other people doing free fixes for you.
exe34
8 months ago
it's more fun/creative/CV-worthy to write new shiny features than to fix old problems.
astrange
8 months ago
I think a more subtle issue is that fixing old bugs can cause new bugs. It's easier to fix something new, for instance because you understand it better. At some point it can be safest to just not touch something old.
Also, old bugs can get fixed by accident / the environment changing / the whole subsystem getting replaced, and if most of your long tail of bugs is already fixed then it wastes people's time triaging it.
wolrah
8 months ago
> I think a more subtle issue is that fixing old bugs can cause new bugs.
Maybe it's years of reading The Old New Thing and similar, maybe it's a career spent supporting "enterprise" software, but my personal experience is that fixing old bugs causing new bugs happens occasionally, but far more often it's that fixing old bugs often reveals many more old bugs that always existed but were never previously triggered because the software was "bug compatible" with the host OS, assumptions were made that because old versions never went outside of a certain range no newer versions ever would, and/or software just straight up tinkered with internal structures it never should have been touching which were legitimately changed.
Over my career I have chased down dozens of compatibility issues between software packages my clients used and new versions of their respective operating systems. Literally 100% of those, in the end, were the software vendor doing something that was not only wrong for the new OS but was well documented as wrong for multiple previous releases. A lot of blatant wrongness was unfortunately tolerated for far too long by far too many operating systems, browsers, and other software platforms.
Windows Vista came out in 2006 and every single thing that triggered a UAC prompt was a thing that normal user-level applications were NEVER supposed to be doing on a NT system and for the most part shouldn't have been doing on a 9x system either. As recently as 2022 I have had a software vendor (I forget the name but it was a trucking load board app) tell me that I needed to disable UAC during installs and upgrades for their software to work properly. In reality, I just needed to mount the appropriate network drive from an admin command prompt so the admin session saw it the same way as the user session. I had been telling the vendor the actual solution for years, but they refused to acknowledge it and fix their installer. That client got bought out so I haven't seen how it works in 2024 but I'd be shocked if anything had changed. I have multiple other clients using a popular dental software package where the vendor (famous for suing security researchers) still insists that everyone needs local admin to run it properly. Obviously I'm not an idiot and they have NEVER had local admin in decades of me supporting this package but the vendor's support still gets annoyed about it half the time we report problems.
As you might guess, I am not particularly favorable on Postel's Law w/r/t anything "big picture". I don't necessarily want XHTML style "a single missing close tag means the entire document is invalid" but I also don't want bad data or bad software to persist without everyone being aware of its badness. There is a middle ground where warnings are issued that make it clear that something is wrong and who's at fault without preventing the rest of the system from working. Call out the broken software aggressively.
tl;dr: If software B depends on a bug or unenforced boundary in software A, and software A fixing that bug or enforcing that boundary causes software B to stop working, that is 100% software B's problem and software A should in no way ever be expected to care about it. Place the blame where it belongs, software B was broken from the beginning we just hadn't been able to notice it yet.
mithametacs
8 months ago
Everything is finite including bugs. They aren’t magic or spooky.
If you are superstitious about bugs, it’s time to triage. Absolutely full turn disagreement with your directions
jfactorial
8 months ago
> Everything is finite including bugs.
Everything dies including (probably) the universe, and shortly before that, our software. So you're right, the number of bugs in a specific application is ultimately finite. But most of even the oldest software still in use is still getting regular revisions, and if app code is still being written, it's safe to assume bugs are still being created by the fallible minds that conceived it. So practically speaking, for an application still in-development, the number of bugs, number of features, number of lines of code, etc. are dynamic, not finite, and mostly ever-increasing.
acacar
8 months ago
No, uh-uh. You can't sweep a data loss bug under the rug, under any circumstances, especially in a filesystem. Curdle someone's data just once and they'll never trust you again.
gpderetta
8 months ago
the CADT model of software engineering.
pj_mukh
8 months ago
"When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan"
Pretty sure this is exactly what happened with Cruise in San Francisco, cars would just stop and await instructions causing traffic jams. City got mad so they added a "pullover" mechanism. Except now, the "pullover" mechanism ended up dragging someone who had been "flung" into the cars path by someone who had hit and run a pedestrian.
The real world will break all your test cases.
sameoldtune
8 months ago
> switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.
Catastrophe is most likely to strike when you try to fix a small mistake: pushing a hot-fix that takes down the server; burning yourself trying to take overdone cookies from the oven; offending someone you are trying to apologize to.
crtified
8 months ago
Also, as codebases and systems get more (not less) complex over time, the potential for technical debt multiplies. There are more processing and outcome vectors, more (and different) branching paths. New logic maps. Every day/month/year/decade is a new operating environment.
mithametacs
8 months ago
I don’t think it is exponential. In fact, one of the things that surprises me about software engineering is that it’s possible at all.
Bugs seem to scale log-linearly with code complexity. If it’s exponential you’re doing it wrong.
InDubioProRubio
8 months ago
Reminds me of the switches we used to put into production machines that could self destroy
if -- well defined case else
Scream
while true do
Sleep(forever)
Same inswitch- default
Basically every known unknown its better to halt and let humans drive the fragile machine back into safe parameters- or expand the program.
PS: Yes, the else- you know what the else is, its the set of !(-well defined conditions) And its ever changing, if the well-defined if condition changes.
agos
8 months ago
Erlang was born out of a similar problem with signal towers: if a terminal sends bogus data or otherwise crashes it should not bring everything down because the subsequent reconnection storm would be catastrophic. so, "let it fail" would be a very reasonable approach to this challenge, at least for fault handling
akavel
8 months ago
Take a look at the book: "Systemantics. How systems work and especially how they fail" - a classic, has more observations like this.
wkat4242
8 months ago
It's been in place for a while, it happens every few months.