They shouldn't have killed an excellent processor (the Alpha) which already had tons of software and history and was already being used in the fastest supercomputers in the world for a product that was never (and still isn't) proven. The Itanic was never best in its class at anything.
> They shouldn't have killed an excellent processor (the Alpha)
Parallel Alpha systems are a pain to deal with, because they lack a form of expected synchronization that every other processor has: automatic data dependency barriers. On every other platform, if you initialize or otherwise write to a value, then make a pointer point to that value, you can expect that anyone reading through that pointer gets the initialized/new value. But on Alpha, another CPU can get the new value of the pointer and then the uninitialized/old value of what it points to.
Alpha is the sole reason why the Linux kernel "smp_read_barrier_depends" barrier exists and code has to use it; on every other platform, that barrier is a no-op.
Is there any evidence that not handling read-read dependencies in hw is crucial to alpha performance?
I'd guess that back when the alpha memory model was designed multiprocessors were quite rare, and designers didn't have such a clear picture of the tradeoffs that we do today (not saying today's understanding is perfect, just that it's better than what we had 30 years ago), and chose the weakest possible model they could come up with in order to not constrain future designers.
agreed, Itanium investment should have gone to the Alpha.
Itanium was really good at raw performance as long as you could write hand tuned math kernels or kept working with the compiler team to optimize code for your kernel. Took me a while, but I got 97% efficiency with single core DGEMM.
Hand-written code for Itanium was always smoking fast. One-clock microkernel message passes and other insanity. But nobody ever figured out how to write a compiler that could generate code like that for that machine.
Most of it depended on the problem: for a subset of problems it worked well but once you had branchy code and less than very consistent memory access it was dismal. I supported a computational science group during that period and Itanium (and Cell) kept being tested but never made sense since you’d be looking at person-years of work hoping you could beat the current systems (or even previous generation) instead of spending that time on improved application functionality.
> for a subset of problems it worked well but once you had branchy code and less than very consistent memory access it was dismal.
So, a lot like coding for the GPU. Makes sense, given that the low-level architecture is so similar... And it might explain why VLIW itself is not so widely used anymore. AIUI, even the Mill proposed architecture (which boils down to VLIW + lots of tricks to cheaply improve performance on typical workloads) has a hardware-dependent, low-level "compilation" step that's quite reminiscent of what a GPU driver has to do.
The GPU comparison is common and I think it hits the main problem: Intel/HP needed to solve two hard problems to succeed. GPU computing had only one because gamers provided a reliable market for the chips in the meantime.
I’m also curious how this could have gone a generation later: Itanium performance was critically dependent on compilers in an era where they were expensive and every vendor made their own, and the open source movement was just taking off. It seems like things could have gone much better if that’d been, say, LLVM backend & tools and higher level libraries where someone could get updates without licensing costs and wouldn’t be in the common 90s situation of needing to choose between the faster compiler and the more correct one.
There were people trying, but there are some real fundamental issues with the approach for general purpose computing. It's extremely hard for a compiler to know if some data is in cache, in memory, or way out in swap. Without this information it's very hard to know how long any memory fetch is going to take. If you're trying to run a lot of computation in parallel that has some interdependencies then this information is paramount.
It's kind of like trying to use a GPU for general purpose computation. Itanium should have been a coprocessor.
> Took me a while, but I got 97% efficiency with single core DGEMM.
In my experience, it's pretty widely accepted that VLIW (and EPIC) can achieve high performance and efficiency on highly regular tasks such as GEMM and FFT. That's why VLIW has been and continues to be popular for DSPs. The struggle for VLIW is general purpose code that doesn't necessarily have that same kind of regularity.
I have to admit that as a DEC alpha user starting in 1993 (using OSF/1), and also one of the main FreeBSD/alpha port authors, the itanium being phased out fills me with joy.
Alpha was excellent. Definitely a missed opportunity. The 164LX here running Tru64 is a great system, proving the chip could really work in all kinds of settings.
Around the time that the Itanium project was announced (and had presumably been being worked on behind the scenes for quite a while) was when Compaq bought DEC. HP wouldn't buy Compaq until about 5 years later. So Itanium was already well underway by the time switching to Alpha would have been a remote possibility.
The people who built the Alpha went on to work at AMD and built the Athlon. Most of the innovations in that platform ended up filtering into the Intel Core 2 and Core i7 architecture as well.