Comment Re:Dubious (Score 2) 164
Before a typical workstation class CPUs had evolved complex instruction sets. There had not been enough focus on measuring the frequency of use of the various features of the instruction set. When people started analyzing this (ie Patterson and Hennessy) they showed that the vast majority of software spent its time executing code from a tiny fraction of the instruction set. Obviously if you make those common instructions execute much faster, you can afford to remove the rarely used instructions and make the compiler generate a few simpler instructions in their place. Once these complex instructions were removed, it became easier to implement a well balanced instruction pipeline on a single chip. This was a big win. The ARM2 achieved 4 MIPS @ 8 MHz. Compared to a Motorola 68000 which was about 0.6 MIPS at 8 MHz. The chips were a similar cost in 1987. You could have got comparable performance from a 386 at that time, but it would have been much more expensive.
I'm not entirely sure why contemporary CISC designs failed to achieve good pipelining. I suspect that _correctly_ implementing a CISC instruction set back then was difficult even without considering performance. The digital design tools and methods of the time were very hard to use. Removing most of the instruction set freed up the digital designer's head so they could concentrate on performance.
By the 2000s though, it was perfectly possible to implement a pipelined CISC processor. One way to do this is to implement a RISC core with a front end that translates CISC instructions into RISC ones. This is what Intel did. The number of gates in the translation logic is significant, but nothing like as large as the number in the L1 and L2 caches that are integrated onto the die these days. The code density in x86 instructions is probably 25% better than a typical RISC instruction set. Therefore you can make the program caches 25% smaller. You probably save enough gates doing that as it cost to implement the translation logic. Another nice advantage of the translation layer is you can change the design of the RISC core whenever you like and no software needs to be ported to the new design.
My day job is R&D on the Kalimba DSP core used in various SOCs designed by Cambridge Silicon Radio. We've just added a translation layer front end to the core to implement a more CISC like instruction set. This improves code density by over 30% and therefore reduces the program ROM on the SOC by 30%. This reduces the overall cost of the SOC. And there's no performance penalty. For DSP like tasks our core is 2-10x higher performance per dollar and per watt than competing ARM designs.
My prediction is that ARM will hold on to the mobile market no matter how hard Intel try. Intel's fabs cost too much to run. TSMC do a much better job. I predict that ARM will gradually take the server and desktop market away from Intel.