Why do systems fail? Tandem NonStop system and fault tolerance

62 pointsposted 11 hours ago
by PaulHoule

18 Comments

Animats

7 hours ago

Tandem was interesting. They had a lot of good ideas, many unusual today.

* Databases reside on raw disks. There is no file system underneath the databases. If you want a flat file, it has to be in the database. Why? Because databases can be made with good reliability properties and made distributed and redundant.

* Processes can be moved from one machine to another. Much like the Xen hypervisor, which was a high point in that sort of thing.

* Hardware must have built in fault detection. Everything had ECC, parity, or duplication. It's OK to fail, but not make mistakes. IBM mainframes still have this, but few microprocessors do, even though the necessary transistors would not be a high cost today. (It's still hard to get ECC RAM on the desktop, even.)

* Most things are transactions. All persistent state is in the database. Think REST with CGI programs, but more efficient. That's what makes this work. A transaction either runs to successful completion, or fails and has no lasting effect. Database transactions roll back on failures.

The Tandem concept lived on through several changes of ownership and hardware. Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.

It's a good architecture. The back ends of banks still look much like that, because that's where the money is. But not many programmers think that way.

sillywalk

6 hours ago

> Databases reside on raw disks. There is no file system underneath the databases.

The terminology of "filesystem" here is confusing. The original database system was/is called Enscribe, and was/is similar to VMS Record Management Services - it had different types of structured files types, in addition to unstructured unix/dos/windows stream-of-byte "flat" files. Around 1987 Tandem added NonStop SQL files. They're accessed through a PATH: Volume.SubVolume.Filename, but depending on the file type, there is different things you can do with them.

> If you want a flat file, it has to be in the database.

You could create unstructured files as well.

> Processes can be moved from one machine to another

Critical system processes are process-pairs, where a Primary process does the work, but sends checkpoint messages to a Backup process on another processor. If the Primary process fails, the Backup process transparently takes over and becomes the Primary. Any messages to the process-pair are automatically re-routed.

> Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.

It did get ported to Xeon processors around 10 years ago, and is still around. Unlike OpenVMS, HPE still works on it, but as I don't think there is even a link to it on the HPE website* . It still runs on (standard?) HPE x86 servers connected to HPE servers running Linux to provide storage/networking/etc. Apparently it also runs supported under VMWare of some kind.

* Something something Greenlake?

Animats

5 hours ago

> Critical system processes are process-pairs, where a Primary process does the work, but sends checkpoint messages to a Backup process on another processor. If the Primary process fails, the Backup process transparently takes over and becomes the Primary. Any messages to the process-pair are automatically re-routed.

Right. Process migration was possible, but you're right in that it didn't work like Xen.

> It still runs on (standard?) HPE x86 servers connected to HPE servers running Linux to provide storage/networking/etc.

HP is apparently still selling some HPE gear. But it looks like all that stuff transitions to "mature support" at the end of 2025.[1] "Standard support for Integrity servers will end December 31, 2025. Beyond Standard support, HPE Services may provide HPE Mature Hardware Onsite Support, Service dependent on HW spares availability." The end is near.

[1] https://www.hpe.com/psnow/doc/4aa3-9071enw?jumpid=in_hpesite...

sillywalk

4 hours ago

It looks like that Mature Support stuff is all for Integrity i.e. Itanium servers. As long as HPE still makes x86 servers for Linux/Windows, I assume NonStop can tag along.

spockz

7 hours ago

Not to take away from your main point: The only reason it is hard to get ECC in a desktop is because it is used as customer segmentation, not because it if technically hard or because it would drive the actual cost of the hardware up.

sitkack

4 hours ago

ECC should be mandatory in consumer and cpus and memory. This will be seen like cars with fins and not having seatbelts in the future.

Animats

2 hours ago

I have a desktop where CPU, OS and motherboard all support it. But ECC memory wa hard to find. Memory with useless LEDs, though, is easily available.

kev009

6 hours ago

Yes, IBM mainframes employ or have analogous concepts to all of this which may be one of many reasons they haven't disappeared. A lot of it was built up over time whereas Tandem started from the HA specification so the concepts and marketing are clearer.

Stratus was another interesting HA vendor, particularly the earlier VOS systems as their modern systems are a bit more pedestrian. http://www.teamfoster.com/stratus-computer

sillywalk

6 hours ago

I present to you "Commercial Fault Tolerance: A Tale of Two Systems" [2004][0] - a paper comparing the similarities and differences towards reliability/available/integrity between Tandem Nonstop and IBM Mainframe systems,

and the book "Reliable Computer Systems - Design and Evaluation"[1] which has general info on reliability, and specific looks at IBM Mainframe, Tandem, and Stratus, plus AT&T switches and spaceflight computers.

[0] https://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers...

[1] https://archive.org/download/reliablecomputer00siew/reliable...

mech422

5 hours ago

Yeah - Stratus rocked :-) The 'big battle' used to be between Non-Stops more 'software based' fault tolerance VS. Stratus's fully hardware level high availability. I used to love demo'ing our Stratus systems to clients and let them pull boards while the machine was running...Just don't pull 2 next to each other :-)

Also, I think Stratus was the first (only?) computer IBM re-badged at the time - IBM sold Stratus's as the Model 88, IIRC

vivzkestrel

12 minutes ago

completely unrelated to the topic written but i wanted to point it out. there is some accessiblity issue with this page. The arrow keys up and down do not scroll the page on Firefox 131.0.2 M1 Mac

macintux

8 hours ago

10 years ago I used Jim Gray's piece about Tandem fault tolerance in a talk about Erlang at Midwest.io (RIP, was a great conference).

https://youtu.be/E18shi1qIHU

Because it's a small world, a former Tandem employee was attending the talk. Unfortunately it's been long enough that I don't remember much of our conversation, but it was impressive to hear how they moved a computer between data centers; IIRC, they simply turned it off, and when they powered it back on, the CPU resumed precisely where it had been executing before.

(I have no idea how they handled the system clock.)

Jim Gray's paper:

https://jimgray.azurewebsites.net/papers/TandemTR86.2_FaultT...

sillywalk

8 hours ago

> I have no idea how they handled the system clock.)

It is or was on the Internet Archive and probably elsewhere -

Tandem Systems Review, Volume 2, Number 1 (February 1986) - "Managing System Time Under Guardian 90"

macintux

8 hours ago

Nice, thanks, will have to look that up.

abrookewood

2 hours ago

That is crazy! I assume that all the RAM was battery backed? What about the CPU cache, the OS state etc? I'm struggling to see how this was possible.

sillywalk

9 hours ago

I'm still hoping to find a more detailed article about modern X86-64 NonStop, complete with Mackie Diagrams.

The last one I can find is for the NonStop Advanced Architecture (on Itanium), with ServetNet. I gather that this was replaced with the NonStop Multicore Architecture (also on Itanium), with Infiniband, and I assume x86-64 is basically the same but on x86-64, but in pseudo big-endian.