muglug
3 months ago
I'm a fan of more formal methods in progam analysis, but this particular excercise is very hindsight-is-20/20
> In this case, we can set up an invariant stating that the DNS should never be deleted once a newer plan has been applied
If that invariant had been expressed in the original code — as I'm sure it now is — it wouldn't have broken in the first place. The invariant is obvious in hindsight, but it's hardly axiomatic.
nyrikki
3 months ago
John McCarthy‘s qualification problem[0] relates to this.
While one can and will add invariants, as they are discovered, they cannot all be found.
Entscheidungsproblem and Trakhtenbrot's theorem apply here, counterintuitively that the validity of finite models is in co-re but not in re.
Validity in this case is not dependent by the truth of the premise or the truth of the conclusion.
Basically we have to use tools like systems thinking to construct robust systems, we cannot universally use formal methods across frames.
It is one way race conditions are complex.
Hindsight bias makes it seem easy but that is because that is in the co-re side.
Well intended actions with hindsight can often result in brittle systems as their composition tends to set systems in stone, with the belief that axioms are the end solution.
The fact that Gödels completeness theorem may not apply for finite systems when it works so well for infinite ones is hard for me to remember.
Remembering that axiomatization is a powerful tool but not a silver bullet has actually helped me more than I can count.
[0] http://jmc.stanford.edu/articles/circumscription/circumscrip...
pas
3 months ago
not deleting the active plan seems like a basic fail-safe design choice, and this isn't AWS people's first rodeo. likely there was some rationale for not going with a built-in fallback.
kqr
3 months ago
If it was, they would have mentioned it in their summary report, the way they justified other deliberate design decisions. I find it more likely they thought of 25 different ways this system could fail, fixed the ones that needed fixing (some of them hinted in the summary report), and then they forgot about that one way it was actually going to fail. Happens all the time.
I agree this article is very hindsight biased though. We do need a way to model the failure modes we can think of, but we also need a method that helps us think of what the failure modes are, in a systematic manner that doesn't suffer from "oops we forgot the one way it was actually going to fail".
lala_lala
3 months ago
Yes, any analysis after an incident has the benefit, and bias, of hindsight.
But I see this post less as an incident analysis and more as an experiment in learning from hindsight. The goal, it seems, isn’t to replay what happened, but to show how formal methods let us model a complex system at a conceptual level, without access to every internal detail, and still reason about where races or inconsistencies could emerge.
user
3 months ago
ignoramous
3 months ago
> but it's hardly axiomatic.
Agree, but Time-of-Check to Time-of-Use is a pretty well established failure mode.
nsatirini
3 months ago
Every such analysis will have some hindsight bias. Still, it’s a great post that shows how to model such behavior. And I agree with the another reply that not deleting an active plan seems like a basic fail safe choice which the post also covered