BTW, this kind of architecture makes it easy to add multiple execution units. With parallel execution and careful use of shared and private FUs and memories you can build a pretty damn powerful special purpose processor without a lot of hardware complexity.
All the execution units (aka co-processors in modern parlance) are still attached to a single bus, so theoretical max throughput is still one instruction per cycle. So this only makes sense if the CPs perform complex operations - like memory management, floating point, mul/div - or something of similar complexity. For the typical simple integer instruction that tends to dominate code it's no better than a microcoded processor - since now each normal instruction requires several cycles on the bus.
Compare this to a pipeline, where each step is what you'd consider a FU, but each one only needs to interface to the previous one, not counting the occasional clever bypass and special linkage. In a pipeline each FU can be fed an input in one cycle and provide an output in the next, something "FU"s (CPs actually) on a shared bus can't.
The other drawback is that anything that goes on the bus needs to implement a generic bus interface and its own internal sequencing logic. This doesn't come free. This is really just a CISC in silicon with exposed microcode, and it's pretty clear the author is thinking CISC all the way as can be witnessed by the stack operations. The typical RISC approach is to have a link register where the return address is placed on subroutine calls, so leaf functions don't have to push/pop it from the stack. Non-leaf functions start by saving the link register to the stack frame in the function prologue once space has been allocated for it.