Porting OpenVMS to the Itanium Processor Family (2003)[pdf]

40 pointsposted 9 months ago
by naves

28 Comments

twoodfin

9 months ago

The Apache #’s pretty much give the game away: An Itanium clocked 50% higher was losing to a 2yo Alpha by about 20% on throughput at peak.

VLIW made sense when Intel wanted to win the FP-heavy workstation market. But while it was in development, integer-heavy web workloads became dominant and that was basically the ballgame.

jcranmer

9 months ago

The Itanium architecture is a weird weird architecture. It's not weird in the sense of Alpha's weirdness (e.g., the super weak memory model), which can be fairly easily compensated for, but it's weird in several ways that make me just stare at the manual and go "how are you supposed to write a compiler for this?" It's something that requires a sufficiently smart compiler to get good performance, while at the same time so designed as to make writing that sufficiently smart compiler effectively compiled.

It wouldn't surprise me if Itanium actually had pretty compelling SPECint numbers. But a lot of those compelling numbers would have come from massive overtuning of the compiler to the benchmark specifically. Something that's going to be especially painful for I/O-heavy workloads is that the gargantuan register files make any context switch painfully slow.

giantrobot

9 months ago

> It wouldn't surprise me if Itanium actually had pretty compelling SPECint numbers.

This was one of the more confusing things about Itanium. It's specifications on paper were really impressive. It had crazy high benchmark results (SPEC, Linpack, etc) compared to similarly clocked competing chips. If real world code behaved like those benchmarks Itanium would have blown other chips out of the water.

I don't know if I've ever seen real-world tests showing the Itanium getting anywhere near the performance of its benchmarks. Intel was also charging a huge premium for the privilege.

bcrl

9 months ago

Some of those results were due to the substantially better memory subsystem that itanics had compared to x86 systems of the time.

hapless

9 months ago

By the time this PDF deck dropped, AMD K10 / Opteron was out, and that ship had sailed.

By 2004 the only thing Itanium had going for it vs a similarly sized Opteron system was the enormous L3 cache. Real good for SPECfp, I guess.

bcrl

9 months ago

Context switches and interrupts were also pretty quick on itanic as well. Modern x86 has so much cruft that it Would Be Nice if those parts could be re-architected, but it's more like to just become more legacy baggage carried on into future CPU generations.

darkhelmet

9 months ago

> "how are you supposed to write a compiler for this?"

It's been a while so I probably am misremembering the terminology, but I was always amused with the dynamic profile/feedback system that I always imagined would be more useful for a JIT code generator or JVM style runtime than a traditional compiler.

aleph_minus_one

9 months ago

There exists a co-evolution between compilers, programming languages and CPUs (or more generally ASICs). I consider it to be very plausible that it is quite possible to develop a programming language that makes it sufficiently easy for a programmer to write performant code for an Itanium, but such a programming language would look different from C or C++.

johndoe0815

9 months ago

The world would be much nicer if we still had new Alpha CPUs. It was intended to be a CPU architecture that lasts 25 years and Digital intended the architecture to support a 1000x increase in performance during that time.

Now we have RISC-V reinventing the wheel. Not the worst outcome, but we could have had it so much better...

formerly_proven

9 months ago

DEC designed StrongARM pretty much immediately after Alpha shipped because Alpha chips ran hot as frick and DEC engineers didn’t see a path to low-power Alpha.

aardvark179

9 months ago

Much as some aspects of the Alpha were great its weak memory model would have resulted in even more concurrency issues than we have now, and way more explicit fences.

ahoka

9 months ago

It couldn't even handle unaligned access in it’s original form. Surely an architecture to last for 25 years.

fredoralive

9 months ago

Not handling unaligned access gracefully is a classic RISC "feature", as part of the general simplification of a processor to its basics. I'm not sure if it's really an Alpha specific thing. Plus they added some instructions to ease the pain in 1996.

The main issue people tend to bring up with Alpha is the very loose memory model, of the "things happen, but different processors may not really agree on the order they happened in" type of thing (plus, isn't it rude to want to know what other cores have in their cache?). Which would be a pain in our modern multicore world.

Of course we don't know how things would've evolved over time, ARM (at least on big cores[1]) shifted towards the forgiving model for unaligned access, it's possible over time Alpha would've similarly moved to a more forgiving environment for low level programmers.

[1] On embedded stuff, you're going to the Hardfault handler.

formerly_proven

9 months ago

Alpha had a super loosy-goosy memory model because iirc the cache size they wanted couldn’t be built with the performance they needed on the process they had, so they made it from two wholly independent cache banks, both serving the same core through a shared queue.

dfox

9 months ago

BWX does not help with unaligned accesses, but solves the fact that original Alpha did not even have instructions for memory accesses smaller than word. Which kind of becomes an issue when you start building the systems on PC-like hardware (another related “feature” is that EV5 does not have equivalent to MTRRs, but dealing with the weirdness of VGA framebuffer accesses is part of the architecture specification by means of hardcoded uncached memory region).

fredoralive

9 months ago

TBH, I'm not an expert on Alpha, and wow, as an embedded programmer by trade, that's really wacky way of handling memory access. I guess it made more sense in the minicomputer world where you control the whole stack, but as a more general purpose architecture its well, not the greatest is it.

dfox

9 months ago

There is a lot of good things to be said about Alpha and it is probably the most RISC of all 90's RISC ISAs, but the actual hardware is full of weirdness and all the real CPUs were deeply pipelined OoO designs (think Intel NetBurst) that prioritized high clock rates and huge straight-line throughput above all else (which is also why that ran really hot and could not be really scaled down for embedded use). Taking good ideas from that while discarding the “speed-demon” design is a part of why AMD become relevant again in 00's with amd64 cementing that position (but well, AMD K7 is very much Alpha-related design, to the extent that the chipsets are interchangeable between K7 and EV6. The interesting part of that is that these CPUs do not have FSB in the “bus” sense, but there is a point-to-point link between CPU and chipset).

fredoralive

9 months ago

AMD using an Alpha bus for early Athlons feels like a weird lost opportunity. Cheap x86 aimed motherboards that can run also run Alpha chips with Windows 2000 + FX!32 for compatibility, it might’ve had a chance to shine, albeit a slight chance. Sadly by then Compaq had already boarded the Itanic…

cbmuser

9 months ago

If your code contains unaligned access, you’re doing something wrong in the first place.

formerly_proven

9 months ago

Itanium was primarily developed by Intel, Itanium 2 primarily by the HP team that also was responsible for the competitive PA-RISC chips. (Or so they say). In any case, Itanium 2 still outperformed much later AMD Opterons and Intel Xeons running at twice the clock in numerical workloads. That’s pretty impressive.

twoodfin

9 months ago

That’s my point: If the demand for high-end compute at the turn of the millennium had looked the same as the demand for high-end compute in 1992, Itanium probably would have conquered the world.

But Tim Berners-Lee had a NeXT and some good ideas…

pjmlp

9 months ago

As CERN alumni, I can say it is quite easy to get distracted with side quests....

hapless

9 months ago

You have it backwards. Itanium 1, which had x86 and some limited hppa compatibility, had a ton of HP input.

Itanium 2 dropped a lot of the broken ideas from Itanium 1 in favor of .... slightly better than the worst performance in the industry.

re: numerical workloads, they achieved high performance on SPECfp exactly the same way as late-generation HPPA chips -- just throw more L3 cache at it until the number looks good. Not exactly engineering genius.

aleph_minus_one

9 months ago

> The Apache #’s pretty much give the game away: An Itanium clocked 50% higher was losing to a 2yo Alpha by about 20% on throughput at peak.

This is not just a benchmark of the CPUs, but also of the compilers involved. It is well-known that it was very hard to write a compiler that generates programs that could harness the optimization potential of Itanium's instruction set.