Animats
9 months ago
Tandem was interesting. They had a lot of good ideas, many unusual today.
* Databases reside on raw disks. There is no file system underneath the databases. If you want a flat file, it has to be in the database. Why? Because databases can be made with good reliability properties and made distributed and redundant.
* Processes can be moved from one machine to another. Much like the Xen hypervisor, which was a high point in that sort of thing.
* Hardware must have built in fault detection. Everything had ECC, parity, or duplication. It's OK to fail, but not make mistakes. IBM mainframes still have this, but few microprocessors do, even though the necessary transistors would not be a high cost today. (It's still hard to get ECC RAM on the desktop, even.)
* Most things are transactions. All persistent state is in the database. Think REST with CGI programs, but more efficient. That's what makes this work. A transaction either runs to successful completion, or fails and has no lasting effect. Database transactions roll back on failures.
The Tandem concept lived on through several changes of ownership and hardware. Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.
It's a good architecture. The back ends of banks still look much like that, because that's where the money is. But not many programmers think that way.
sillywalk
9 months ago
> Databases reside on raw disks. There is no file system underneath the databases.
The terminology of "filesystem" here is confusing. The original database system was/is called Enscribe, and was/is similar to VMS Record Management Services - it had different types of structured files types, in addition to unstructured unix/dos/windows stream-of-byte "flat" files. Around 1987 Tandem added NonStop SQL files. They're accessed through a PATH: Volume.SubVolume.Filename, but depending on the file type, there is different things you can do with them.
> If you want a flat file, it has to be in the database.
You could create unstructured files as well.
> Processes can be moved from one machine to another
Critical system processes are process-pairs, where a Primary process does the work, but sends checkpoint messages to a Backup process on another processor. If the Primary process fails, the Backup process transparently takes over and becomes the Primary. Any messages to the process-pair are automatically re-routed.
> Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.
It did get ported to Xeon processors around 10 years ago, and is still around. Unlike OpenVMS, HPE still works on it, but as I don't think there is even a link to it on the HPE website* . It still runs on (standard?) HPE x86 servers connected to HPE servers running Linux to provide storage/networking/etc. Apparently it also runs supported under VMWare of some kind.
* Something something Greenlake?
Animats
9 months ago
> Critical system processes are process-pairs, where a Primary process does the work, but sends checkpoint messages to a Backup process on another processor. If the Primary process fails, the Backup process transparently takes over and becomes the Primary. Any messages to the process-pair are automatically re-routed.
Right. Process migration was possible, but you're right in that it didn't work like Xen.
> It still runs on (standard?) HPE x86 servers connected to HPE servers running Linux to provide storage/networking/etc.
HP is apparently still selling some HPE gear. But it looks like all that stuff transitions to "mature support" at the end of 2025.[1] "Standard support for Integrity servers will end December 31, 2025. Beyond Standard support, HPE Services may provide HPE Mature Hardware Onsite Support, Service dependent on HW spares availability." The end is near.
[1] https://www.hpe.com/psnow/doc/4aa3-9071enw?jumpid=in_hpesite...
sillywalk
9 months ago
It looks like that Mature Support stuff is all for Integrity i.e. Itanium servers. As long as HPE still makes x86 servers for Linux/Windows, I assume NonStop can tag along.
Animats
9 months ago
Right, that's just the Itanium machines. I'm not current on HP buzzwords.
The HP NonStop systems, Xeon versions, are here.[1] The not-very-informative white paper is here.[2] Not much about how they do it. Especially since they talk about running "modern" software, like Java and Apache.
[1] https://www.hpe.com/us/en/compute/nonstop-servers.html
[2] https://www.hpe.com/psnow/doc/4aa6-5326enw?jumpid=in_pdfview...
lazide
9 months ago
As a side point - that is some amazing lock in.
MichaelZuo
9 months ago
They were pretty much the only game in town, other than IBM and smaller mainframe vendors, if you wanted actual written, binding, guarantees of performance with penalty clauses. (e.g. with real consequences for system failure, such as being credited back X millions of dollars after Y failure)
At least from what I heard pre-HP acquisition, so it’s not ‘amazing lock in’, just that, if you didn’t want a mainframe and needed such guarantees, there was literally no other choice.
lazide
9 months ago
Notably, that is amazing lock in. What else would it look like?
MichaelZuo
9 months ago
Well if just price/performance alone is enough to qualify… viz. IBM, Then the moment another mainframe vendor decided to undercut them by say 20%, the lock in would evaporate. Of course no mainframe vendor would likely do so, but the latent possibility is always there.
Facebook is an example of ‘amazing lock in’ where it’s not theoretically possible for any potential competitor to just negate it with the stroke of a pen.
lazide
9 months ago
The reason they are locked in is because they are the only game in town for this use case, done this way. That’s why I’m saying it, yeah?
It isn’t a price point thing.
MichaelZuo
9 months ago
In that sense IBM offers a better ‘game’ in every way, but at 10x the price point… because they are playing a different, more advanced, ‘game’ that so happens to include Tandem’s ‘game’ as a subset.
lazide
9 months ago
That just means they’re locking in a different segment.
Do you think folks locked into to MS Access are the same people locked into Oracle databases?
MichaelZuo
9 months ago
Yes but its something that can disappear with the stroke if a pen, which is the critical difference, the durability.
user
9 months ago
adastra22
9 months ago
> Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.
My dad continues to maintain NonStop systems under the umbrella of DXC. (Which is a spinoff of HP? Or something? Idk the details.) He worked at Tandem back in the day, and has stayed with it ever since. I think he'd love to retire, but he never ends up as part of the layoffs that get sweet severance packages, because he's literally irreplaceable.
The whole stack got moved to run on top of Linux, IIRC, with all these features being emulated. It still exists though, for the handful of customers that use it.
Sylamore
9 months ago
Kinda the other way around, the NonStop kernel can present a Guardian personality or an OSS (Open Systems Services) linux-like compatible personality. The OSS layer is basically running on top of the NSK/Guardian native layer but allows you to compile most linux software.
adastra22
9 months ago
No, I meant the other way around. I don’t know to what degree it ever got released, but he spent years getting it to work on “commodity” mainframe hardware running Linux, as HP wanted to get out of the business of maintaining special equipment and OS just for this customer.
kev009
9 months ago
Yes, IBM mainframes employ or have analogous concepts to all of this which may be one of many reasons they haven't disappeared. A lot of it was built up over time whereas Tandem started from the HA specification so the concepts and marketing are clearer.
Stratus was another interesting HA vendor, particularly the earlier VOS systems as their modern systems are a bit more pedestrian. http://www.teamfoster.com/stratus-computer
sillywalk
9 months ago
I present to you "Commercial Fault Tolerance: A Tale of Two Systems" [2004][0] - a paper comparing the similarities and differences towards reliability/available/integrity between Tandem Nonstop and IBM Mainframe systems,
and the book "Reliable Computer Systems - Design and Evaluation"[1] which has general info on reliability, and specific looks at IBM Mainframe, Tandem, and Stratus, plus AT&T switches and spaceflight computers.
[0] https://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers...
[1] https://archive.org/download/reliablecomputer00siew/reliable...
mech422
9 months ago
Yeah - Stratus rocked :-) The 'big battle' used to be between Non-Stops more 'software based' fault tolerance VS. Stratus's fully hardware level high availability. I used to love demo'ing our Stratus systems to clients and let them pull boards while the machine was running...Just don't pull 2 next to each other :-)
Also, I think Stratus was the first (only?) computer IBM re-badged at the time - IBM sold Stratus's as the Model 88, IIRC
spockz
9 months ago
Not to take away from your main point: The only reason it is hard to get ECC in a desktop is because it is used as customer segmentation, not because it if technically hard or because it would drive the actual cost of the hardware up.
sitkack
9 months ago
ECC should be mandatory in consumer and cpus and memory. This will be seen like cars with fins and not having seatbelts in the future.
Animats
9 months ago
I have a desktop where CPU, OS and motherboard all support it. But ECC memory wa hard to find. Memory with useless LEDs, though, is easily available.
spockz
9 months ago
That is because it doesn’t make sense producing a product that cannot be used at all. It just doesn’t work in consumer boards due to lack of support for it in consumer CPUs. Again due to artificial customer segmentation.
c0balt
9 months ago
Most ryzen CPUs have supported some ECC RAM for multiple years by now. The HED platforms, like Thread Ripper, did too. It just hasn't really been advertised as much because most consumers don't appear to be willing to pay the higher cost.
PhilipRoman
9 months ago
Ok, I'll bite - what tangible benefit would ECC give to the average consumer? I'd wager in the real world 1000x more data loss/corruption happens due to HDD/SSD failure with no backups.
Personally I genuinely don't care about ECC ram and I would not pay more than $10 additional price to get it.
adastra22
9 months ago
Most users experience data loss due to ECC these days. They just might not attribute it to cosmic rays. It's kinda hard to tell ECC data loss apart from intermittent hardware failure. It can be just as catastrophic though, if the bit flip hits a critical bit of information and ends up corrupting the disk entirely.
immibis
9 months ago
My Threadripper 7000 system with ECC DDR5 and MCE logging reports a corrected bit error every few hours, but I've got no idea if that's normal. I assume it was a tradeoff for memory density.
MichaelZuo
9 months ago
This, memory densities are so high nowadays it’s almost guaranteed that a new computer bought in 2024 will hard fault with actual consequences (crashing, corrupted data, etc…) at least once a year due to lack of ECC.
Sylamore
9 months ago
Speaking of Tandem Databases, HP had released the SQL engine behind SQL/MX[0] as open source (Trafodion) running in front of Hadoop to the Apache Software Foundation but it appears they have shutdown the project[1].
[0]: https://thenewstack.io/sql-hadoop-database-trafodion-bridges...
mannyv
9 months ago
Oracle has had raw disk support for a long time. I'm pretty sure it's the last 'mainstream' database that does.