However, if you are aiming for high-end systems conditional move may not be that big of a deal. See for example the following analysis from Linus Torvalds regarding cmov
The problem with Torvalds' analysis (which is otherwise pretty good and worth reading) is that it only looks at local effects. The problem with branches is not that they're individually expensive, it's that each one makes all of them slightly more expensive. A toy branch predictor is basically a record of what happened at each branch, to hint what to do next time. Modern predictors use a variety of different strategies (normally in parallel) with local state stored in something like a hash table and global state shared with all branch locations. If the branch that you've added happens to hit the same table entry as another, then it may cause mispredictions elsewhere. This is horrible to try to model, because you have a performance cliff that moves depending on code layout (which is part of the reason why randomising the order of functions in a program can have around a 20% impact on performance).
The probability of misses increases based on branch density. Short branches have another issue, which is that typical superscalar processors don't make per-instruction predictions, they make predictions per fetch granule. If you have an 4-way superscalar processor, then you get one prediction for each 4 instructions. If you have two branches in there, then they'll be predicted together. This means that you really don't want a short branch immediately followed by another branch (or following a not-taken branch) unless you're really careful about alignment.
Note that some processors have spectacularly bad implementations of either branch prediction or conditional moves. PowerPC is notorious for this, where the performance difference between the two representations varies hugely between different iterations by the same vendor (and more between vendors), so compiler tuning is almost impossible.
If the ARM instruction set was designed today however it is likely that the designers would not go crazy with conditional execution since the bits could be better used for something else
You can see this with ARMv8. Most of the predication is gone, there are basically just conditional moves (conditional select) and conditional branches. There was apparently a lot of argument about whether to allow conditional loads. These are very useful because most other conditional patterns can be reduced to a conditional move: calculate both sides and select the one that you wanted, but is the condition is 'pointer is not null' then you can't speculatively load and then decide not to take the trap. Another proposed alternative is to have a non-trapping load (i.e. one that returns zero if that load would trap), though this can be difficult in the presence of swapping (the processor doesn't know the difference between a not-present page and a not-present-now-but-can-be-swapped-in page).
Nokia killed themselves long before Microsoft got involved. Going with Windows Phone was a long shot that might have worked. Keeping their existing strategy of 6 independent teams all trying to sabotage each other and ignoring the fact that there was actual competition in the smartphone market was an even worse alternative.
Nokia has a well-designed kernel (Symbian EKA2) with a horribly dated set of userspace APIs designed for a time when 2MB of RAM was a lot. Their solution? Replace the kernel with Linux and then fight internally about what the userspace would look like, so each line of phones needed apps specially written for it and had a shelf life of about 2 years before it was replaced with an entirely incompatible set of APIs that required large parts of the apps to be rewritten.
Outsourcing OS design to Microsoft was no worse than this, and if Microsoft had managed to persuade anyone to write Windows Phone apps might have worked. There were a lot more third-party apps for Windows Phone than for the last couple of attempts for Nokia to build their own ecosystem after a decade of pissing off everyone who used to like their stuff.
Exactly how do you expect conditional moves to be executed at the renaming stage?
The conventional way is to enqueue the operation just as you do any other operation that has not-yet-ready dependencies. When the condition is known, the rename logic collapses the two candidate rename registers into a single one and forwards this to the pipeline. Variations of this technique are used in most mainstream superscalar cores. The rename engine is already one of the most complex bits of logic in your CPU, supporting conditional moves adds very little extra complexity and gives a huge boost to code density.
This is a disadvantage if one expect that all processors are the same and expect the code optimized for one ISA (and likely microarchitecture) should run well on other ISAs. Really bad.
If you come along with a new ISA and say 'oh, so you've spent the last 30 years working out how to optimise this category of languages? That's nice, but those techniques won't work with our ISA' then you'd better have a compelling alternative.
That isn't the only way to solve that problem, in fact that sounds like a very bad design.
It is on RISC-V. For the J extension, we'll probably mandate coherent i-caches, because that's the only sane way of solving this problem. Lazy updates or indirection don't help this, unless you want to add a conditional i-cache flush on every branch, and even that would break on-stack replacement (deoptimisation), where is not always a branch in the old code, but there is in the new code, and it is essential for correctness that you run the new code and not the old.
MIPS was killed?
Yes. It's still hanging on a bit at the low end, mostly in routers, where some vendors have ancient licenses and don't care that power and performance both suck in comparison to newer cores. It's dead at the high end - Cavium was the last vendor doing decent new designs and they've moved entirely to ARMv8. ImagTec tried to get people interested in MIPSr6, but the only thing that MIPS had going for it was the ability to run legacy MIPS code, and MIPSr6 wasn't backwards compatible.
Custom instruction support is a requirement for a subset of the market and it doesn't cause any problem
Really? ARM seems to be doing very well without it. And ARM partners seem to do very well being able to put their own specialised cores in SoCs, but have a common ARM ISA driving them. ARM was just bought by Softbank for $32bn, meanwhile, all of the surviving bits of MIPS were just sold off by a dying ImagTec for $65m. Which strategy do you think worked better?
Can't run the code from a microcontroller interfacing a custom LIDAR on the desktop computer? Who the fuck cares? Really?
How much does it cost you to validate the toolchain for that custom LIDAR? If it's the same toolchain that every other vendor's chip uses, not much. If it's a custom one that handles your special magic instructions, that cost goes up. And now your vendor can't upstream the changes to the compiler, because they break other implementations (as happened with almost all of the MIPS vendor GCC forks), so how much is it going to cost you if you have to back-port the library that you want to use in your LIDAR system from C++20 or C21 to the dialect supported by your vendor's compiler? All of these are the reasons that people abandoned MIPS.
How do the J2/3/4 open source SuperH designs compare?
I've not looked at SuperH in detail, so I can't really compare.
I seem to remember there were other pitfalls to their architecture, but getting a processor that is Management Engine (Aka Clipper+Palladium+TPM) free is a huge boon to the future of computer security
I disagree. A TPM, secure enclave, or equivalent, is increasingly vital for computer security. It is absolutely essential that you have some write-only storage for encryption keys into a coprocessor that will perform signing / signature verification / encryption / decryption, but which does not allow the keys to be exfiltrated. Anything less than this and a single OS-level compromise means that you need to reset every password and revoke every key that you've used on that machine.
Having said all this: Is it perhaps time for a different CPU project, or a fork of RISC-V with these missing features added, at the risk of binary incompatibility, but to the benefit of performance and perhaps security?
There are lots of extensions to RISC-V, but the problem there is fragmentation. You need the A extension if you want to run a real OS. You probably need the C extension, because compilers are starting to default to using it. The M extension is useful, so people will probably start using it soon. Hardware floating point is expensive on low-end parts, so you're going to end up with some having F, some having D, and some having neither (this was a pain for OS support for ARM until recently - now ARM basically mandates floating point on anything that is likely to run an OS), and a few will support Q. L is unlikely to be used outside of COBOL and Java, so isn't too much of an issue (one is niche, the other is typically JIT'd so it doesn't matter too much if only some targets support it). And that's before there's any widely deployed silicon. Expect vendors to add their own special RISC-V instructions, making their own versions of toolchains and operating systems incompatible.
RISC-V isn't the first project to try this. OpenRISC has been around for a lot longer, but RISC-V managed to get a lot more momentum. I don't think that a competing project would find it easy to get any of this. It remains to be seen whether this momentum can translate to a viable ecosystem.
This is a rather tricky problem. It would take the likes of an organization like DARPA to solve it.
DARPA's SSITH program is attempting to address this, but I suspect that a complete solution is much bigger than a single DARPA program.
Less instruction sets makes assemblers and compilers easier to implement
I'll give you assemblers (though assemblers are so trivial that there's little benefit from this), but not compilers. A big motivation for the original RISC revolution was that compilers were only using a tiny fraction of the microcoded instructions added to CISC chips and you could make the hardware a lot faster by throwing away all of the decoder logic required to support them. Compilers can always restrict themselves to a Turing-complete subset of any ISA.
RISC-V is very simple, but that's not always a good thing. For example, most modern architectures have a way of checking the carry flag for integer addition, which is important for things like constant-time crypto (or anything that uses big integer arithmetic) and also for automatic boxing for dynamic languages. RISC-V doesn't, which makes these operations a lot harder to implement. On x86 or ARM, you have direct access to the carry bit as a condition code.
Similarly, RISC-V lacks a conditional move / select instruction. Krste and I have had some very long arguments about this. Two years ago, I had a student add a conditional move to RISC-V and demonstrate that, for an in-order pipeline, you get around a 20% speedup from an area overhead of under 1%. You can get the same speedup by (roughly) quadrupling the amount of branch predictor state. Krste's objection to conditional move comes from the Alpha, where the conditional move was the only instruction requiring three read ports on the register file. On in-order systems, this is very cheap. On superscalar out-of-order implementations, you effectively get it for free from your register rename engine (executing a conditional move is a register rename operation). On in-order superscalar designs without register renaming, it's a bit painful, but that's a weird space (no ARM chips are in this window anymore, for example). Krste's counter argument is that you can do micro-op fusion on the high-end parts to spot the conditional-branch-move sequence, but that complicates decoder logic (ARM avoids micro-op fusion because of the power cost).
Most of the other instructions in modern ISAs are there for a reason. For example, ARMv7 and ARMv8 have a rich set of bitfield insert and extract instructions. These are rarely used, but they are used in a few critical paths that have a big impact on overall performance. The scaled addressing modes on RISC-V initially look like a good way of saving opcode space, but unfortunately they preclude a common optimisation in dynamic languages, where you use the low bit to differentiate pointers from integers. If you set the low bit in valid pointers, then you can fold the -1 into your conventional loads. For example, if you want to load the field at offset 8 in an object, you do a load with an immediate offset 7. In RISC-V, a 32-bit load must have an immediate that's a multiple of 4, so this is not possible and you end up requiring an extra arithmetic instruction (and, often, an extra register) for each object / method pair.
At a higher level, the lack of instruction cache coherency between cores makes JITs very inefficient on multicore RISC-V. Every time you generate code, you must do a system call, the OS must send an IPI to every core, and then run the i-cache invalidate instruction. All other modern instruction sets require this to be piggybacked on the normal cache coherency logic (where it's a few orders of magnitude cheaper). SPARC was the last holdout, but Java running far faster on x86 than SPARC put pressure on them to change.
Licensing also matters a lot
This is true, but not in the way that you think. Companies don't pay an ARM license because they like giving ARM money, they pay an ARM license because it buys them entry into the ARM ecosystem. Apple spends a lot of money developing ARM compilers, but they spend a lot less money developing ARM compilers than the rest of the ARM ecosystem combined. Lots of companies benefit from sharing Linux and FreeBSD ARM development costs. This works because ARM severely restricts what you can do with an ARM core. You can't add custom instructions and you're increasingly required to use the same PIC interface and so on. This means that code written for vendor A's ARM cores will work with vendor Q's ARM cores. One of the big things that killed MIPS was their failure to enforce this. Every MIPS core had custom extensions and porting Linux + GCC from one MIPS vendor's core to another's was a fairly painful experience. Not only does RISC-V make this kind of fragmentation easier, the current structure of the ecosystem makes it inevitable.
What we anticipate seldom occurs; what we least expect generally happens. -- Bengamin Disraeli