RickHGeek - Slashdot User

Comment Re:An intelligent comment on the subject (Score 2, Informative) 392

by RickHGeek on Friday October 11, 2002 @11:56PM (#4435720) Attached to: Revolutionizing x86 CPU Performance

By forcing a lookup table and its associated logic into the mix, you potentially are significantly reducing a processor's speed and/or scalability.

The added logic would primarily exist in the decode phase. Provided the decoders could be pumped with enough data to overcome the increase in code size such a model could potentially introduce, it would not be a problem. The internal logic units would have to be modified to deal with that kind of reference.

I posted a reply to the ChipGeek blurb on this subject (www.chipgeek.com) where I describe the type of engine required to execute this RM/RMC model. I visualize it like a round waterfall viewed from above. In the pool area leading up to the waterfall, all of the required processing taking place to prepare the data to be sent to the logic units. Data is pulled from the correct location in register space (a very simple process). It is resized to the appropriate operand during the pull. It is tagged with an indicator that will instruct a rapid-process retirement unit to write the contents back to register space (following execution).

One thing that many people seem to be confusing is the concept of internal register renaming with what I'm doing. While it is arguable that what I've essentially done is introduce programmer-assigned register renaming, there is a distinct component to that renaming that most people seem to overlook completely (I've seen a few responders that nailed it). That is the fact that I, as the assembly programmer, or the compiler would be able to determine which registers propagate in which locations throughout the program. We have access to knowledge that a statistical runtime execution model does not. The x86 architecture provides almost no methods of conveying known-at-compile-time information to the processor (except through the overall code design following required rules dictated by the processor architecture), so it has to use statistical algorithms to rely on appropriate register renaming.

My proposal would allow that decision to be made by the programmer. After all, Intel's currend modus operandi with IA-64 seems to be "let the compiler or assembly programmer dictate everything". They are no longer interested in employing all of the OOO execution models that the P6 core has provided. That's why Itanium performs so poorly on x86 code. It has a P5 engine which doesn't employ any of those hardware speedups. The same code executed in x86 mode on an Itanium, then recompiled in IA-64 mode will run much faster after the recompile. Why? Because rather than executing the instructions one after another, the compiler has positioned the code in a manner which conveys as much parallelism as possible. The compiler made those decisions, not the CPU, and the performance benefits are there (see Itanium 2 numbers on a recent Ace's Hardware article: http://www.aceshardware.com/#60000436).

What I propose would require a modest redesign of the hardware. It would require a minor extension to the instruction set. I can visualize about 40 different ways to implement the broad-strokes I painted with my feature (I didn't specifically name or assign opcode sequences, there are 3 unused bits in RMC which could be utilized to help in some way, etc.). There are several ways of arriving at the same final result in hardware. In my opinion it's up to people to explore the possibilities rather than critize the idea. Personally, I like what AMD did with the x86-64 and the REX override prefixes. In 64-bit long mode they threw out redundant one-byte opcode instructions that were duplicated with other multi-byte opcode sequences and utilized them as a series of overrides which provide additional information regarding each instruction, and did so with a single byte.

If that method were employed then the code size increase would be minimal. The only design points left to hit are how to redesign the core so the registers are in a central-access location rather than remote locations of the chip. I'm not saying it wouldn't be difficult. But, it would only have to be designed once and all software written from that point forward would have the potential of benefiting from it.

- Rick C. Hodgin, geek.com

Slashdot Top Deals