Just curious, what's so wrong with branch with delay slot and isn't that more native way to look at branch ?
They're a pain for people on both sides of the ISA.
The compiler has to find an instruction that can run after the branch. This is normally trivial for calls, but for conditional branches within a function it's often difficult to find an instruction that you can put there. It has to be one that is either from before the jump (or in both basic blocks after the jump), but that the branch doesn't depend on (because it's executed after the branch instruction). This means that you quite often end up padding the delay slots with nops, which bloats your instruction cache usage. On a superscalar implementation this is the only cost, but on a simple in-order pipeline it's also a completely wasted cycle.
On the other side, it's a pain to implement. It made sense for a three-stage pipeline in the original MIPS, because you always knew the next instruction to fetch. A modern simple pipeline is 5-7 stages though, so your branch is still in register fetch (if there) by the time the delay slot is needed. It doesn't buy anything and it means that, if you're doing any kind of speculative execution (even simple branch prediction, which you really need to do to get moderately good performance) then you have an extra dependency to track - you can't just use the branch as the marker and flush everything after it, you need to do some reordering. In a superscalar implementation, you need to do even more complex things in register renaming to make it work.