arjie
13 hours ago
The biggest mistake I made was high uptime. arjie.com was up for 10 years plus on a Hetzner VPS so that by the time they wanted to sunset the machine underlying I had no idea what my teenage self had set up. I have the backups but the site hasn’t been up in a decade…
Nowadays I build things so that they move and I have moved things about a bit so I know they work.
gerdesj
13 hours ago
"The biggest mistake I made was high uptime"
Quite. I'm old enough to remember machine uptime being a badge of honour.
However, being older and not really wiser, I look for service uptime these days. Yes we did have similar back in the day, that's why MX and the like DNS records exist.
Old school clusters were pretty esoteric but the lessons were learned (split brain n that) and that's why we still argue the toss with kiddies about why a Proxmox cluster with two nodes is fucked and why we recommend an additional "witness".
I don't care that VMware glossed over the whole two node HA cluster thing years ago with a massive bodge. They were wrong then and they are probably still wrong because that nonsense is probably still baked in.
Sorry, slight digression.
High uptime implies no patching. We all love patching.
andai
12 hours ago
https://en.wikipedia.org/wiki/Split-brain_(computing)
The more you know!
>a Proxmox cluster with two nodes is fucked and why we recommend an additional "witness".
Reminds me of the three Magi from Evangelion: https://magi.kinta.ma/
pjmlp
6 hours ago
There is something like live patching.
One reason mainframes and micros are still around us, is that you can change almost everything between hardware and software without downtime.
It is also available in commercial surviving UNIXes, and as paid for feature in some Linux distros, although not to the extent that those grandparent systems are capable of.
da_chicken
4 hours ago
The problem with live patching is twofold.
First, you might not reload everything in memory, so it will be patched on disk but not in process.
Second, you have not tested that the system can boot to a functional system. Say you have done live patching for 5 years and never rebooted, and then you have a power loss or hardware failure/upgrade that takes the system down. When you try to bring it back up, it doesn't work. Which configuration change in the past 5 years caused that? Which backup do you use?
And, yeah, everything is hot swappable on VAX. Those machines also cost 6+ figures, and often require a service contract that includes a permanent on site tech.
silvestrov
4 hours ago
A Danish bank found out that this can bite you in the ass.
When you hotpatch the system for years then you have no idea if the system can boot up or it will fail somewhere in the booting process.
i.e. you can only trust what you regularly test.
pjmlp
4 hours ago
Interesting, it there any public info on the case?
Not doubting it, only curious about some kind of postmorten.
silvestrov
an hour ago
In Danish: https://danskebank.com/da/news-og-insights/nyhedsarkiv/press...
or translated: https://danskebank-com.translate.goog/da/news-og-insights/ny...
TLDR: power supply failed completely and DB2 failed running recovery operations due to multiple old/existing software bugs.
pjmlp
42 minutes ago
Thanks for hunting it down.
Scramblejams
6 hours ago
I’ve long wanted that amazing uptime and virtualization and huge I/O and all that cool stuff mainframes offered, but on the desktop or in the closet, with modern CPUs.
I think I’m gonna hafta keep waiting...
ErroneousBosh
4 hours ago
> One reason mainframes and micros are still around us, is that you can change almost everything between hardware and software without downtime.
We have some Sun V880s at work and I'm fairly sure the only part you cannot change with the power on and system running is the motherboard itself.
And I would not be surprised if some ex-Sun Gandalf Beard "well akshully"s this comment.
AdamN
4 hours ago
two is the right minimum number for a high availability dataplane but three is the right minimum number for a HA control plane.
With that said, if high availability is not a concern then 1 can be just fine.
j45
an hour ago
It's pretty easy to abstract away a proxmox node into a terraform or other type of code based recipe for easy backup / reconstruction / upgrading.
niel
7 hours ago
This reminds me of Ise Shrine in Japan, which is completely dismantled then rebuilt every 20 years.
This is top of mind because I recently read Breakneck by Dan Wang. He makes the case that this practice of rebuilding the shrine preserves knowledge that would otherwise have been lost to time. Wang contrasts Ise Shrine with Notre Dame, where rebuilding the roof is apparently quite difficult, perhaps in part due to the loss of knowledge. I'm not familiar enough with either structure to judge whether this is a fair comparison, but I like the principle.
(Edit to add: This is only a minor analogy from the book, which I highly recommend overall.)
arjie
7 hours ago
Thank you for the recommendation! I love that reference, and particularly because I am fond of the story of the shrine for a different reason https://wiki.roshangeorge.dev/w/Constancy_Preference#Concept...
nine_k
13 hours ago
Indeed, for a VM, high uptime makes little sense, because a reboot takes a few seconds, and an upgrade requires no downtime, just switching the DNS to a new instance.
For a physical machine which you can't easily copy, it's a different story.
bfivyvysj
13 hours ago
I started putting things in a big ansible playbook repo. Don't need to have it fully managed by ansible either I mostly just have setup configured there I still do lots of by hand management.
arjie
12 hours ago
I have the same. The infra management is in one place, the apps hold their own, and there’s a docs folder on the server where each guy puts his stuff. The install is idempotent deploy scripts. But back then my stuff was more ramshackle.
culi
7 hours ago
Sometimes I leave Architectural Decision Records for personal projects. It feels silly but it honestly comes in handy more times than expected
gofreddygo
7 hours ago
I keep them embedded in the codebase or an artifact right next to the source.
And the key thing is that i dont need too many details at all. A few cues and its all back in my head.
walletdrainer
6 hours ago
> The biggest mistake I made was high uptime. arjie.com was up for 10 years plus on a Hetzner VPS so that by the time they wanted to sunset the machine underlying I had no idea what my teenage self had set up. I have the backups but the site hasn’t been up in a decade
LLMs have solved this problem, they’ll happily deal with the software archaeology on your behalf. This is the kind of task they really excel at.
arjie
6 hours ago
You're right, of course. At this point it's inertia. It's been dead a decade.