Itanium is a direct result of the hardware people and the software people refusing to rub elbows in the same room.
Itanium's designers basically declared war against their software peers. Our beautiful machine would run fast, if only your crappy software didn't expose so many execution hazards.
Thus Intel set up a grand gauntlet for the compiler writers to finally prove their ultimate hardware manhood: by writing an Itanium compiler that didn't suck.
We all know how that went.
I've always though the made the critical error on the first step after connecting with the baseball: the bundle should have been a collection of highly dependent instructions that only wrote back to the register file once fully executed. A bundle would be dispatched to one execution unit's queue, and sit there until complete.
Because the bundle has internal dependencies, this means that each bundle would have a significant internal latency, so each execution unit queue would need a fairly deep dispatch buffer. "Waste of silicon!" cry Intel's virtuous hardware engineers. "Latency is a thing in the world!" whimper Intel's pussy-whipped compiler writers.
I also think that for compact instruction packing, that not every input argument to a bundle should have been able to name any old register out of the giant register file (256 registers, IIRC). Maybe a few global references, then plenty of 4-bit register selectors, accessed out of register file shards (the base shard could be selected by any number of mechanisms, up to and including the program location from which the instruction bundle was fetched; so maybe the code only ends up relocatable modulo 64 or modulo 256? what a horrible tragedy—plus the compiler writers already had great register colouring algorithms, so we had somewhat of a proof-of-concept in hand for the compiler complexity).
Because you're bypassing the register file for many bundle internal arguments (the result ax in the expression ax + b never hits the register file), fewer register file (and memory) reads satisfy more total instructions. I would have liked to see a bundle of about the same size able to specify up to eight simple instructions, if sufficiently chained together.
To a certain degree, this opposite-George approach kicks determinism to the curb. Well I say better sooner than later. Hey Intel engineers, look at your vaunted determinism now, dead with a bottle on skid row, after a long, loosing battle.
(The other thing Intel liked about determinism was its first five letters. Sometimes stupid ideas present a broader field for patent lock-up land-grabs. Moral of the story: greed carefully.)
It's easy to come up with hundreds of good reasons why my opposite-George approach wouldn't have panned out any better, but a smart group of engineers is paid to find clever solutions to most or all superficial obstacles. Whether any counterfactual designs might have proved viable is permanently lost to history. I'm just relating my own instinct at the time, FWIW.
Another thing: I would have endorsed a big/little design for interrupt handling (of the asynchronous type), with only a small set of agile (aka bundle-free), little cores able to handle interrupts. Then you can really afford to thin out check-point writes back to the register file (which is always a hot point to begin with). The magical, invisible forwarding mesh to support this illusion seamlessly would still be extremely complex, but that's also true on every other modern design.
From my perspective, Itanium was plenty innovative, unfortunately, it was mainly innovative in pure stubbornness and greed.