The "n" in FMAn refers to the number of register arguments:
r1 = fma3(r1 + r2 * r3) vs. r1 = fma4(r2 + r3 * r4)
FMA4 can save you a move or two, depending on where you've got the accumulator now, and where you want it to be.
TOOK HIS JERB
You, sir, win one internet. Use it wisely.
Uops on P6 were 118 bits: http://www.eecg.toronto.edu/~moshovos/ACA05/read/ppro1.pdf
That would have a slight impact on code density
Individually they aren't too bad. Taken all together they create real problems.
64 predicate registers (which is way too many) yields 6 bits per syllable (the Itanium term for instruction). Combine that with 128 int regs (7 bits per) and 3 register operands - you've got 27 bits before specifying any instruction bits.
The impact of the middle one (instruction steering) was also not seen until late in the design cycle. Instruction decode information got mixed in there, so that not every instruction could go to every position. This led to a large number of NOPs inserted into the instruction stream. The final code density for Itanium was significantly lower than RISC (and way under x86).
These factors also work against out-of-order implementations - but there were organizational impediments to that happening anyway...