Before a typical workstation class CPUs had evolved complex instruction sets. There had not been enough focus on measuring the frequency of use of the various features of the instruction set. When people started analyzing this (ie Patterson and Hennessy) they showed that the vast majority of software spent its time executing code from a tiny fraction of the instruction set. Obviously if you make those common instructions execute much faster, you can afford to remove the rarely used instructions and make the compiler generate a few simpler instructions in their place. Once these complex instructions were removed, it became easier to implement a well balanced instruction pipeline on a single chip. This was a big win. The ARM2 achieved 4 MIPS @ 8 MHz. Compared to a Motorola 68000 which was about 0.6 MIPS at 8 MHz. The chips were a similar cost in 1987. You could have got comparable performance from a 386 at that time, but it would have been much more expensive.
I'm not entirely sure why contemporary CISC designs failed to achieve good pipelining. I suspect that _correctly_ implementing a CISC instruction set back then was difficult even without considering performance. The digital design tools and methods of the time were very hard to use. Removing most of the instruction set freed up the digital designer's head so they could concentrate on performance.
By the 2000s though, it was perfectly possible to implement a pipelined CISC processor. One way to do this is to implement a RISC core with a front end that translates CISC instructions into RISC ones. This is what Intel did. The number of gates in the translation logic is significant, but nothing like as large as the number in the L1 and L2 caches that are integrated onto the die these days. The code density in x86 instructions is probably 25% better than a typical RISC instruction set. Therefore you can make the program caches 25% smaller. You probably save enough gates doing that as it cost to implement the translation logic. Another nice advantage of the translation layer is you can change the design of the RISC core whenever you like and no software needs to be ported to the new design.
My day job is R&D on the Kalimba DSP core used in various SOCs designed by Cambridge Silicon Radio. We've just added a translation layer front end to the core to implement a more CISC like instruction set. This improves code density by over 30% and therefore reduces the program ROM on the SOC by 30%. This reduces the overall cost of the SOC. And there's no performance penalty. For DSP like tasks our core is 2-10x higher performance per dollar and per watt than competing ARM designs.
My prediction is that ARM will hold on to the mobile market no matter how hard Intel try. Intel's fabs cost too much to run. TSMC do a much better job. I predict that ARM will gradually take the server and desktop market away from Intel.
> RISC is a superior instruction set. x86 only beat RISC because it was really the only game in town if you want to run Windows
Modern ARM processors aren't pure RISC processors. Most ARM code is written in Thumb-2, which is a variable length instruction code just like x86. Back in the 90s when transistor budgets were tiny, RISC was a win. When you only have a hundred thousand gates to play with, you're best off spending them on a simple pipelined execution unit. The downsides of RISC has always been the increased size of the program code and reduced freedom to access data efficiently (ie with unaligned accesses, byte addressing and powerful address offset instructions). With modern transistor budgets it is worth spending some gates to make the processor understand a compact and powerful instruction set. That way you save more gates in the rest of the computer than you spend doing this (ie in the caches, databuses and RAMs).
As a result of all this, in some ways, ARM chips are evolving to look more and more like an Intel x86 design. I'm still a big fan of ARM though. Intel will have a long way to go to compete on price, even if they can compete on power.