Actually, there is much more parallelism (more than 4 ops/cycle) available in many of these applications, but you correctly observe that many of these ancillary features (branch mispredictions, cache misses, etc.) chip away at the achieved parallelism. The TRIPS ISA and microarchitecture (which is, as you correctly point out, a variant of an OOO "superscalar" processor) has numerous features to try to mitigate many of these features
... up to 64 outstanding cache misses from the 1,024-entry window, aggressive predication
to eliminate many branches, a memory dependence predictor, and direct ALU-ALU communication for
making data dependences more efficient.
The most important difference is in the ISA, which allows the compiler to express dataflow graphs to
directly to the hardware, which will work best (compared to convention) in ultra-small technologies where
the wires are quite slow. To get a similar dependence graph in a RISC or CISC ISA, a superscalar processor
must reconstruct it on the fly, instruction by instruction, using register renaming and issue window tag
Thanks for reading.