Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Intel

P4 - The Art Of Compromise 117

Buckaroo writes: "Interesting article at EETimes on what Intel's architects originally had in mind for the P4 - 1 slow ALU, 2 fast ALUs, 2 FPUs, 16K of L1 cache, 128K of L2 cache, 1 MB of external L3 cache, etc. - It was all too big and hot, so a bunch of it got the chop." This article sheds new light on the reasons behind performance problems with the chip.
This discussion has been archived. No new comments can be posted.

P4 - The Art of Compromise

Comments Filter:
  • by Anonymous Coward
    why is it this way...
    Intel is deemed evil by slashdot trolls, therefore, intel is evil. AMD is in competition with intel, therefore AMD is better...

    I have yet to see AMD do anything worthwhile for the computer industry, they decresed the total cost of ownership for PC's, but in doing so, forced the slowing of technology advance. Think about it for a little while, AMD releases thier first chip to be comparable in quality and performance to an intel chip (Athlon). Don't even talk to me about the k6 series, they were pathetic. Having this cheaper alternate, Intel responded with a 'Marketing' release of a "new" processor, the PIII: thus delaying the release of the first processor to use a considerably new design since the ppro. Ppro, PII, and PIII are all basicly the same. Sure, a few added instructions, but the same basic design plan. The p4 is much different. I'd like to see something come of it instead of people belittle it because it is only 50% faster than anything else on the market. Shut up people! the P4 is here, and it'll be here to stay. Its up to people to decide on performance so that in the future, I can afford one...

  • by Anonymous Coward
    What are you going to connect it to? With L2 cache hit rates far above 90%, even SDRAM's jump from 66MHz to 133MHz offers pretty minimal performance gains, and faster RAM is dramatically more expensive. IIRC, PCI can keep up with IEEE-1394 and gigabit Ethernet (or if it can't, 66MHz PCI in servers can) and nothing else is even close to needing as much bandwidth as existing systems support.
  • Also, you talked about the SPEC benchmarks in your post. I'd like to stress that SPEC tests a platform, not just a CPU. It depends a lot on the memory subsystem and on the compiler. In the case of the P4, the good SPEC_FP scores are largely due to the large memory bandwidth and the use of some SSE2 code by the Intel compiler. I'd like to see scores using a more conventional compiler, it would show that these scores are not only due to Intel architects, but also to their compiler team!

    You raise a very good point here. SPEC tests a system platform because that is what is relevant. No one cares about raw CPU speed, nor should they. The CPU is designed for the system in which it will function. Things like caches, instruction window size, branch prediction, etc. are all designed to tolerate the latencies of the expected system platforms.

    Using a "more conventional compiler" doesn't make any sense. What's a conventional compiler? The compiler and CPU designes should go hand-in-hand.

    --

  • However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.

    This is false. Lack of registers increases the number of required memory operations. Not only do you increase data cache bandwidth and occupancy, you do the same on the instruction cache as well. Instructions are not free, regardless of whatever "throw hardware at it" myth is popular today. Memory instructions also create headaches for the instruction scheduler. It's much easier to do dependency checking on static indicies.

    There are more things to fetch, decode, schedule, execute and retire. What's good about that?

    It hurts. A lot. Check out some of our papers [umich.edu] on the subject, especially this one [umich.edu], which contains many references to other work.

    --

  • Great post! It's about time someone wrote it.
    The other negative of the x86 ISA, namely the paucity of compiler-visable registers, is indeed a problem, although one partially aleviated by rename registers and partially by evolutionary extensions to the x86 ISA, such as SSE2, which will eventually replace much of the god-awful stack-based x87 FPU ISA.

    Partially correct. Hardware register renaming does nothing to alleviate the low compiler-visible register count. The compiler still has to spill much, much more than it should on the x86. It is true however that SSE2 is a big improvement. Getting rid of the FP stack will really boost performance in that arena.

    Why not add more compiler-visible registers, one may ask? Well, the problem is the encoding size. The x86 ISA is very nice in the sense of code size. This matters when you consider ICache sizes. Intel can get away with smaller ICaches because more instructions fit into the cache for a given size than in a similar 32-register RISC machine. It's interesting that Intel essentially used the same trick with the trace cache. They compress the micro-ops so more of them will fit into the cache.

    Needless to say, "The designers thought of it, implemented it (which they did in this case), and it was a good feature (i.e. improved performance/cost on a majority of code), but then made a boneheaded decision not to use it," is *not* on the list.

    Amen! Intel engineers are not stupid. They removed these features for good reasons. Intel has a policy that a 1% area increase must result in at least 1% performance increase. This was obviously not the case with these structures.

    Frankly, I am stunned at when Intel has been able to do with the x86 ISA.

    --

  • However, I question how much of an _impact_ the bad side effects have in practice. It's non-negligeable, but that leaves a lot of territory open.

    IMHO, many people underestimate the problems associated with lots of instructions. The effective window size of the processor is reduced and more memory operations wreak havoc on dependency checking, for example.

    However, most of your works focus on how the program use and physical performance of a register file of a given size may be improved.

    Really, we focus on how a large register file can be used, but you are essentially correct.

    he advantage of a relatively large register file (at least for the 64-vs-32 case) is found to be relatively modest (5%-20%).

    Is 20% modest? It's a heck of a lot more than most hardware optimization papers get. :) The paper you cite didn't include our more recent work on speculative register promotion and some windowing stuff we're looking at.

    You're right in that a 20% speed increase does not make a processor overwhelmingly more marketable. But remember that most of the speedup we see from generation to generation comes from the circuit technology.

    It is interesting as you point out that x86 seems to be holding its own against machines with larger register files. I suspect this may have something to do with circuit technology, better cache utilization and other such things. I only know from our own experiments that if we turn our register set down to 8 registers, things are really hosed. Not only is there lots of spilling, but the compiler has to throttle itself to avoid even more spilling.

    Granted, most of the spill code is going to be in the cache. One way to look at it is that this extra code forces the caches to be larger. Eliminating instructions (especially memory ops) reduces the size of the machine overall (caches, issue logic, etc.). This in turn will greatly reduce production costs.

    Ah the interplay...this is why I love sitting on the fence! :)

    --

  • OK, so you put in yet more parallel memory buses, and you get, say, 6GB/sec transfer. So what? Your latency is just as bad as with 3 GB/sec, and for most things you're likely to do, that's more important.

    That is to say, it's more important, at this stage, to reduce the time required to start reading something from memory. Basically, if you're reading in lots of relatively small data, a lower latency, lower bandwidth memory architecture will finish before the higher latency, higher bandwidth architecture.
    --
    Change is inevitable.

  • Just kidding....
  • That seems logical since the AMD processors is faster so you would want to run the slow software on the fast box and the fast software on the slow box to end up with equally fast computer. :-)
  • Or, in short, IA32 itself has nothing to do with the P4's lack of oomph. Which should be obvious, since the things it's being compared unfavorably to are other x86 processors!

    Eeh? What about the fact that you're limited to 8 registers, then? Sure enough, a modern x86 CPU has many more internally (using register renaming), but wouldn't it be nice having the option to compile your code to, say 40 registers instead of 8? Or what about the lack of a properly laid out FP register system, instead of that quite stupid stack layout?

    /* Steinar */

  • huh, how is that article more informative? It has exactly the same information as the original one. Well, yes it has a better title ;-).
  • Lets face it, no matter how good a chip is, most people just assume that clock speed is what is important. The marketing people want higher clock speeds, that's what they are going to get.
  • If he posted ilms it would have been a troll.
  • so you bash rambus and lust after them all in the same post? that's rather interesting, i must say....
  • they'er already on at least a "divide by 7" circuit... at least for for the 933 and 1.13 GHz versions...
  • Screeched to a halt at 400 MHz? it's barely even reached that milestone as of yet... The simple reason why CPU MHz is racing past system bus/memory bus is quite simply, the circuits are a lot shorter and more defined/controlled within a CPU than across a motherboard. Frequencies that run okay inside a CPU would fry a motherboard because so much more voltage would need to drive them at that speed for the additional distances.

    Quit griping about the low low levels of the computer unless you actually have a reason to. And by that i mean if you're running a linux kernel with a debugger on and it says that 90% of the time the cache is empty and it's waiting on the memory bus, then you'll have a gripe. But most operations these days are taking place inside the CPU and the chip is not spending a *great* deal of time waiting for new data, except in extreme cases...
  • actually the PII was the major set back. The PPro was optimized to run 32bit. Intel spent years developing it thinking that 32bit software would be the standard by then. With the half-assed WIn9x release from MSFT, intel was forced to re-do the ppro as the PII to run MS junk.

    It was MSFT, with their crappy OS that put intel behind in development. AMD just took advantage of it.

    JON

  • > As it is, PReP/CHRP is moribund (so there are no commodity PPC chips or boards on the market) and in the end IA64 may finish it off.

    I very much doubt it - PPC is used a lot in embedded stuff because if its low power use, and a G4 makes a useful DSP. It could one day disappear from desktops and higher though. Despite any technical merits, the G4 has been a marketing disaster - people just see 500 MHz and think "slow".

    As for Apple's scorched earth policy - I agree that they handled it really badly, but I think overall they didn't have a lot of choice about the clones, although it still would have been nice to see CHRP, even if you needed more to run Apple software. PowerComputing was starting to undercut their hardware sales quite badly, and contrary to what I have seen suggested, Apple wouldn't effectively survive as just a software company.
  • The poster wasn't comparing RISC and CISC. He was comparing instruction throughput of one generation of chips to the previous and noting that P4 was going backwards, not forwards.

    john
  • No.

    A 2x die size saving results in 5% drop in performance because we're in the realm of diminishing returns. Chip designers are trying to squeeze the max. performance out of chips, and have to resort to cleverer and cleverer ideas and bigger and bigger areas of silicon.

    john

  • works

    http://www.eetimes.com/story/industry/semiconduc to r_news/OEG20001213S0045
  • if you remove the space
  • Don't forget about the 23% who called it "Pentium Ivy", and got all excited about Drew Barrymore...
    --
  • If I'm right the Motorola 68040 used an oscillator or crystal that ran on twice the CPU speed ie. 33MHz proc had a 66MHz crystal.

    I'm too lazy to check the facts from the Motorola website, but that's how it's done on (at least) HP 9000/380 machines.
  • Put it this way: for the purposes of a simple explanation, you need to say how many Hz it is. That's as much detail as most people want or need to know. What is that number?

    No, when I'm being told the *technical details* about something, I demand accuracy! It infuriates me when some techie tells me that "I don't need to know" or that "I don't want to do that".
    --------

  • It doesn't make much engineering sense to make the bus speed faster than what can be accommidated by current or near future DRAMs.
  • Yes. Every new architecture core that comes out (486, Pentium, P6, K6, Athlon, and now P4) represents a geometric increase in density and complexity than the generation before it.

    Wozniak had it easy, relative to microprocessor architects nowadays.

  • I'm curious to see what a manual like that looks like.

    Later,
    ErikZ
  • This is what I said, wasn't it? My posts must be confusing or something.
  • I get modded down for apologizing?
  • I always wondered why Intel didn't continue its 486 line after the Pentium came out. The higher clocked 486s would make for a good "Pentium SX" kind of thing. I mean, just shrinking the fab process alone would probably allow them to hit around 150MHz - 200 MHz (eventually) without significant core revisions. The 486s probably wouldn't cost much more to manufacture, and thus Intel could probably get similar margins to the Pentium and ensure rapid obsolescence. I also wonder that about the Pentium line: I'm pretty sure Intel could have taken it above the 233 mark if they really wanted to. I guess when you're #1 you play by your own rules.

    And what you say about buying cheaper CPUs to afford more RAM: thank you. I make this point to every associate who thinks buying some 733 MHz "essentials" computer that only has 32MB of RAM and a 2MB video card is a good idea. I tell them a Pentium Pro 200 with 128MB RAM and a decent video card will outperform that joke of a machine.
  • I never liked this chip and AMD's claims at how "fast" it was. Yes, it was seven megahertz faster than Intel's 33 MHz, but it yielded only about a 12% improvement, about half as much as its clock speed indicated. In addition, it ran absurdly hot and had many more assorted troubles than Intel's chips. I cannot remember how much this chip cost in relation to the Intel 33 MHz, but if it was cheap enough, it would have made a good personal computer. But this chip stunk for production use.

    Sorry, your post is absolutely correct, but I felt compelled to post this rant for some reason.
  • This almost always happens when Intel releases a new core. I think the 386 was worse than the 286 at 16 bit code, even.

    It's very difficult to manufacture an architecture that is both more efficient and remains 100% compatibility (performance and otherwise) with previous generations. Sure that Athlon TBird 1.2 GHz will kill that 286 in any 16 bit app, but I bet if you break down performance per megahert, the 286 may very well come out on top.
  • It's the same difference between the 8088, the 80186, and the 80286. Also the same difference between the Pentium and the Pentium Pro (II/III/Celeron/Xeon, etc). I don't define what a generation is, that's up to Intel's marketing, but I'm sure a "generation" is termed as a new number to the x86. To be fair, though, each generation that has been given a new number has been worthy in my opinion. The only exception to this would be the original 186, but the souped up low power embedded clones that run at 40 MHz can compete with 386s! Likewise, the 486 ran approximately 100% - 150% faster than a 386 at the same clock speed, so I'd call that "generation worthy."

    As for your statements to the 386 and 486, they're a bit off. Yes, a 486SX had no math co-processor and the 486DX did, but the SX/DX in the 386 era referred to its data bus. The 386SX had only a 16 bit data bus while the DX had a 32 (I think) bit one. While the average 386DX did come with the 387 processor, you could not assume this from the DX name. Intel's naming standards are so clear, aren't they? I especially liked how they named the 486 DX/3 the 486 DX/4 because the 4 "reminded people of a 486." Right.
  • 80386: 1985 (this one is tricky because it was released well before it was actually used)
    80486: 1989
    80586: 1993
    60686: 1995

    I think since the Pentium 4 is out now, we can assume it's not to be released in 2001 :)
  • The article said that a new architecture came out about every five years. Not to get nitpicky, but seven archs in eighteen years is quite a far stretch from "every five". Maybe their latest change was five years (which it was), but certainly not the ones before it. Were this statement true, we'd be hitting the 486 in 2002.
  • I think that is what Intel wanted to imply also, yes. I can't back that quote of mine up, but I read it in a magazine a long, long time ago. It was one of those 486 FAQ deals, and one of the questions was something like "Where is the DX/3?" and the magazine said that Intel named it the DX/4 to "remind" people of the 486. I don't know exactly what kind of reminding those numbers serve; all I know is that the DX/4 had a 3x multiplier.
  • The FPU on the P4 is already quite large, much larger than the ALUs anyway. The original design would have indeed been very large, with twice that area.

    I couldn't get a real picture of the P4 die; best I could manage is the cutesy little colored rectangles on page 6 of this Intel PDF [intel.com]. Point is, assuming an overall colored rectangle size of 217mm^2, the "Enhanced Floating Point/Multi Media" section comes out to under 17mm^2 by my crude measurements. And I frankly doubt that when they say that adding another FPU would "double the floating point size", they actually mean double everything in that little teal box. Even assuming I'm wrong, 16.5mm^2, while certainly bigger than the ALUs (and don't forget, this "floating point" box includes integer SIMD execution as well) is a mere 7% of total die size. While this is somewhat significant, if they really wanted it in they certainly could have made room for it. As a percentage of overall die space it's much smaller than the P3's FPU.

    What I saw instead was the admission that adding the extra FPU would have added an extra stage to the pipeline (extra decoding step). It may be that the pipeline was not well balanced with this extra stage, or that it was still in the critical path even with its own pipeline stage, or just that they thought 19 (not including those outside the trace cache) was enough.

    In any case, I'm not at all convinced that this decision had to do with die size at all, but rather with rampability and overall IPC. Indeed, as I said, with properly compiled code, the P4's "crippled" FPU is able to scream along, keeping up just fine with its 3.2 GB/s memory bus. Considering most P4s will have higher clock speeds and less memory bandwidth, why add extra FPU units? About all its extra 2 FPUs do for the Athlon is help it in cache-constrained toy benchmarks. In the real world, FPU work increasingly means data sets too large to fit in on-chip cache, and a single FPU becomes more than adequate to keep up.
  • That is right, although it doesn't mean that. Of all the 4 word combinations starting with I L R H, you have to pick that one. =D

  • "not posting anonymously to not preserve my charma"

    =D

  • Any of the Live! series, almost anything made by Turtle Beach..
  • The Yamaha SW1000XG/DSP Factory combo is superb (but pricey).
    --
    Cheers
  • Here are the facts.
    Intel has been making processors for a very long time and people have come to rely on them as a quality company. But as of late they have been having trouble and people are looking for alternatives to using Intel products.
    Microsoft has been making operating systems for a while now and they have been trusted as a good company. Microsoft makes other software that people also like. Lately people have been getting feed up with Microsoft and are looking for alternatives.

    AMD somewhat of a suprise company. At first many people did not rely on them and did not think that they made quality processors. Now many people are starting to support AMD because other companies, mainly Intel are going down.
    Linux an upstart operating system that some people still know nothing about. Many people are learning about Linux everyday and are accepting it as an alternative to Microsoft and its products.

    Look at the facts that's a very uncanny resemblence.
    >neotope

  • I like that.... slag someone off about tech-ignorance, and then be talking about negative kelvin. Nice one. hehehehe....

    Buckets,

    pompomtom
  • That 400mhz figure is a fairytale ... its a 100mhz buss that xmits 4 whatevers per clock cycle ... They're flat out lying ... some people might ask what the difference is ... but if they didn't think there was a difference, they'd say "its a 100mhz buss that transmits four words per clock' instead of "its a 400mhz bus"
  • Its one of the fir$t po$ters syndrome, press the button quick! This thread brought to you by lameness. If you have an opinion about it, you should Meta Moderate [slashdot.org] on a regular basis to be sure lame moderators get the axe.

    Hey is that why I haven't been a moderator in a while?

  • Three points:
    • my K6-III is performing fine, thank you. Got me much more bang for my bucks than anything else would because I didn't have to buy a new motherboard.
    • Uh... competition is now supposed to hamper innovation? Where did you get that from, the microsoft website?
    • Where the heck do you get the "50% faster than anything else" figure? Last time I checked, people dissed the P4 because it was slower than all the other recent chips at similar clock rates.
  • Sorry.. But you can get good performance with few registers. As someone who worked in the Arospace industry working on F15/F16 engine controls I can attest to this. We had CPU's that didn't even have STACKS and had 2 registers (yes.. 2). We have 64K programs running 4 times a second doing 255 A/D conversions.. So don't blame bad performance on lack of registers. Blame it on bad programming/bad compilers.
  • It was obvious for me that P4 was an unfinished product even before this statement:

    1) no multi cpu
    2) new SSE2 fpu but old fpu slower
    3) custom socket that will be replaced
    4) little chaches
    5) very big heatsink

    It is like software, intel engineers are making a new cpu for .13 micron tecnology. But marketing people say "we need a new cpu now to make some money" so intel put out an unfinished/unoptimized cpu because some "stupid" people will surely buy it.
  • Nope. I'm running Linux on my Intel box and Windows on my AMD box. Mwahahahahaa.
  • It appears that he stuck in a hyperlink with a very long, all lower case HREF to reduce the percent of the comment that was in caps to below the filter threshold. He avoided having the link actually work by giving it a zero length field to link to, so your browser doesn't display it. It appears that the lameness filter needs some work.

  • Fast, Stable, Cheap: Choose one.

  • Care to explain the difference between the 386 and the 486 to me? From what I heard, it was just a few tweaks. The math coprocessor wasn't added until the 486DX, and I think there were 386DXs as well. The SX of each generation being those without the coprocessor. Of course, correct me if I'm wrong.
  • The US-II's in the Ultra Enterprise Servers come with up to 8-MB of cache. Check out this [sun.com] for more info. Admittedly you're not going to be using a US$15,000 processor for your workstation, but still....
  • So, here is a mirror [johncglass.com]

  • For an IMO interesting technical discussion of this issue, you might want to see this thread over at Ace's Hardware. (See if you can guess who I am. :)
    Wally: "I think I see it!

    Alice: "It's not the magic eye, doofus."

  • HTTP/1.1 500 Internal Server Error Date: Fri, 15 Dec 2000 03:47:55 GMT Server: Apache/1.3.12 (Unix) mod_perl/1.24 Set-Cookie: Apache=172.16.45.1.15124976852076308; path=/ Last-Modified: Wed, 28 Oct 1998 17:52:37 GMT ETag: "335ac-189-363759e5" Accept-Ranges: bytes Content-Length: 393 Connection: close Content-Type: text/html We're sorry! Your request has generated an error of some kind. The error has been logged and will be examined promptly by our technical staff. We apologize for the inconvenience.

    Looks like they've been slashdotted.

  • Maybe not, but cost analysis is engineering. Don't believe me? Next time your manager asks you to outline your approach to a new problem, present him with something that requires 10,000 developers and a $60 billion equipment expenditure.

    Exactly, but cost analysis in this case would be the job of engineers. And the engineers are the ones who would have the best idea of the die size/cost in advance. So, if someone says "the engineers really wanted to build something 2x bigger but we wouldn't let them", (1) this is nothing new and (2) he's implying that his engineers are idealistic and out-of-touch with business reality. All engineers dream of the next bigger and better thing. So the remaining question is: what is the real reason they had to cut down. Looking at the die photo, L2 and FP units are not that big compared to all that pipeline-logic.

    Yes boss, our next server should use the Hoover dam as a power supply, and hand-wound relays instead of transistors for the processor core. Actually, I'd kind of like to see that...

    I've seen more project proposals of the type "give me 3 engineers and some time" and we'll come back with something bigger than the Hoover dam.

  • Ah! Real discussion!

    The "Ideal RISC vs. real-life x86 CISC" argument is well founded. The paper-RISC ideals that people have in their heads are much better than the legacy x86 compatbile stuff, obviously. So don't be fooled by the blank-ISC is better. There's solid proof that Intel's IA32 cores would be far more powerful if they didn't tote legacy hw.

    Let's say you took Intel's IA-32 and pruned out the following:

    * 8/16-bit operands and addressing modes
    * x86 floating point instructions
    * prefixes/overrides

    90% of the decode hardware is simply there to handle this legacy crap, at least according to Intel at ISSCC'96.

    The decode pipeline alone on the P3 (don't know about P4) would be 4x smaller. That is a huge performance gain. So it isn't Cisc vs Risc, it is 20-year old legacy Cisc vs clean-slate Risc. The fact that the former holds it's own against the latter in its current state is a powerful message.

    Thing is, the NT kernel never even has to use the three bullets above, but the poor decoder has to handle anything. Sad.

    As for loop unrolling, bundling and register renaming, try this: write some trivial C code with some loops and doubles and fp math. Compare the ASM code compiled by the MSVC 6 (barf) default compiler, and Intel's beta proton 5. You'll be surprised at how weird the resulting code looks, and how much faster it runs.


    ---
  • we can only crank out so much from today's motherboard/x386 technology

    oops...I meant x86 technology. Obviously the 386 can only crank out so much :)
  • Designers Dream

    Engineers try to make it work

    Bean counters fsck it all up

    Marketing says it's the great solution

    Customers buy it anyway

    Customer service is the last to know of any problems or design changes

    Even at AMD...

    --

  • But I gotta have my Everquest and other various computer games. :) The VirtuaPC software is a great x86 emulator and I would be content to use it for all Windows-only software, only it simply cannot handle games. (Trust me, I've often tried to find ways to rationalize a move to Apple for, among other reasons, the Cinema Display. :)
  • Am I the only one who remembers like 3 years ago or so, CPUs were starting to get ~ 32-64 KB L1 cache and 512K L2 cache? I don't know where the *fuck* things took a 180 degree turn but they did and I'm just wondering why I've never heard any commentary on it. It's freaks me out big time. That L1 and L2 cache helps *hugely*.
  • Yes I'd like 1 slow ALU, 2 fast ALUs, 2 FPUs, 16K of L1 cache, 128K of L2 cache, 1 MB of external L3 cache and an order of fries and a coke...what?! You say it's gonna be $300 more than the place across the street?!
  • This is nothing new in my opinion, Intel obviously rushed through this project to try and contend with AMD, if Intel keeps this up, they'll end up like 3dfx.
  • by Anonymous Coward
    Also, the name of the chip was supposed to be "Pentium IV", but their studies indicated that 32% of the surveyed americans pronounced it "Pentium Eve", and another 26% pronounced it "Pentium I've". So they decided to write it "Pentium 4" instead.
  • Your latency is just as bad as with 3 GB/sec, and for most things you're likely to do, that's more important.

    It's all a matter of cost. If cost isn't a problem, prefetch can be used along with cache to help minimize the latency issue. Especially if there is a prefetch instruction so the compiler (or even the programmer through a pragma) can issue a prefetch instruction.

    I imagine the issue there is related to the extra hardware complexity needed to make that happen.

  • Sigh. What was that .sig about the label "insightful" saying more about the moderator than the moderated?

    Look, we've been over this before, but I'll say it again. Yes, Intel's ISA sucks. No, it isn't "slow". IA32 hasn't been directly implemented in a CPU core since the Pentium MMX. The only place on the P6 and later designs that deal with x86 are the decoders. Internally it uses RISC-like micro-ops, which is convenient since most compilers only use the simpler x86 instructions. In the P4 the situation gets even better, since the trace cache holds decoded information -- the decoders aren't even in the critical path! Which is why x86 processors are able to compete successfully with RISC cores on performance.

    Or, in short, IA32 itself has nothing to do with the P4's lack of oomph. Which should be obvious, since the things it's being compared unfavorably to are other x86 processors!

    Ahem. As to the 'not much perfomance decrease'... Well, maybe a re-read is in order, or at least a re-think.

    The 5% was for cutting the FPU's area in half, not the whole chip! 5% is a huge effect on overal performance for a change in just one part of the architecture. For something that relied solely on FPU performance (the photoshop and 3dsmax benchmarks AMD does so well on), it would certainly be much more than 5%.

    And that was just one number for one change. That no other specific numbers were given doesn't imply that they were 0%. I'd say it's more likely that they are larger than 5%.

    Originally, the L1 data cache was supposed to be 16K, accessed in 1 cycle. That wouldn't work, so instead of increasing the access time they cut the size in half. I guarantee you that cutting the size of the l1 in half has a big impact on performance.

    An off-chip L3 would have been nice, too. Especially when paired with high-latency RDRAM. This would have had a huge impact on performance, especially in benchmarks that are sensitive to memory latencies. Doubling the size of the l2 (but increasing access time as a result) probably doesn't mitigate this much.

    The P4 of the Intel architects' dreams would have smoked. Instead we have what we have. x86 has nothing to do with it... economics and engineering reality do.

    Lastly, Itanium is going to suck. Intel has said as much themselves. It's neat technology, but not well designed. It's the Daikatana of the chip industry -- a running joke that some people hope will come off well anyway, but who inevitably will end up dissapointed.
  • However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.

    [Emphasis added.]

    There are more things to fetch, decode, schedule, execute and retire. What's good about that?

    As clearly stated in my original post - nothing at all. However, I question how much of an _impact_ the bad side effects have in practice. It's non-negligeable, but that leaves a lot of territory open.

    It hurts. A lot. Check out some of our papers on the subject, especially this one, which contains many references to other work.

    Done. I compliment you on your fascinating approaches to register use optimization. However, most of your works focus on how the program use and physical performance of a register file of a given size may be improved. The dependence of performance on register file size is only studied in one document ("The Need for Large Register Files in Integer Codes"), and the advantage of a relatively large register file (at least for the 64-vs-32 case) is found to be relatively modest (5%-20%).

    A factor of two speed difference makes a processor unmarketable. A 20% speed difference doesn't (witness the holy war still going on between Intel and AMD proponents).

    The effect of a small register file is undoubtedly more severe as size decreases, but I have yet to see evidence of truly earth-shattering performance impacts. Circumstantial evidence suggests that the effect is not earth-shattering (SPECmarks for high-end workstation chips fail to thoroughly trounce SPECmarks for x86 chips for comparable configurations, and the PowerPC architecture fails to blow x86 out of the water).

    Most certainly, a larger register file is nice, and causes a speed improvement - but the effect of a small register file does not seem to be as devastating in practice as you appear to be suggesting above.
  • You raise several good points; however, it turns out that there are a few mitigating factors.

    First, the lack of registers in the x86 architecture. Having a fast cache is great, but it's not as fast as a register, and it takes extra instructions to load and store

    This is true, and greatly hampers things like loop unrolling on the x86. However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.

    Second is the relatively finer granularity of the instructions available on a RISC architecture. Although there is some merit to making decisions based on information only available at runtime, that isn't a big factor with today's technology. What a modern x86 looks like is a microcode architecture with somewhat intelligent scheduling of the instructions. In most cases a compiler could do a better job.

    Actually, since the Pentium Pro, x86 processors have been fundamentally RISC-ian. x86 instructions are decoded into "micro-ops" (Intel's term), which are essentially RISC instructions. These can be scheduled by the processor as effectively as RISC instructions.

    The decoding adds latency, but that's what the P4's "trace cache" is for. Arguably, a compiler with access to the underlying RISC instruction set could do better scheduling, but in practice the gain is marginal (especially since most people don't seem to use really-good compilers). I also have a sneaking suspicion that basic blocks in most code are small enough to fit inside the processor's scheduler window, which means that the compiler probably _wouldn't_ do a better job in most cases than the hardware scheduler. Higher-level transformations like loop unrolling have benefit even if done at a CISC level.

    In summary, I'm not sure there's a very big performance hit from the instruction granularity (just a silicon hit).

    I am impressed with your knowledge of the subject, though.
  • This is the best thing I've read on /. in weeks.

    (jfb)
  • Is there any generation of processor development in which this sort of thing hasn't happened?

    There seem to be two constants in processor generations that I can see:

    (1) The new generation always takes longer, runs slower, and has more things taken out than was originally suggested,
    (2) the old generation gets ramped up to clock speeds way beyond what was originally anticipated while we wait.

    I remember when the PowerPC G4 was going to have a lot of changes including multi-core and run at most of a GHz - what we eventually got was in some ways a smartened up (mostly fp & memory improvements) G3 + altivec. Meanwhile IBM keeps making the G3s faster and faster.

    Then of course there is the story of how the whole x86 architecture wasn't supposed to get this far before being replaced by something with less cruft.
  • Stoopid < character - have to wait for Slashdot to allow me to post, and then not tell me that the comment has been posted already... Knew I should have Previewed the post...

    especially servers (which is currently the only thing P4s are worthwhile in right now)

    Erm, riiiight. The lack of dual and quad processor capable P4s and P4 motherboards is a major reason why the P4 will not be used in servers in any company that thinks its policies through.

    The fact that the CPU and chipsets are also unproven will mean that corporates will hold back. They also realise that for servers, the FPU performance is not an issue, nor is the presence of amazing SIMD capabilities. Multi-processor capability yes, CPUs with lots more cache that 256K yes.

    The power requirements of the P4 are also staggering - up to 75W. Compare this with the "too hot" Athlon at 40 -60W, and with the <20W Palomino 1.5GHz coming out in March/April next year... AMD760MP chipset... Foster not arriving for another year... PIIIs not getting any faster... AMD have to leap at the chance to get a foothold in the multiprocessor server market in the next 6 months.

    I for one will jump at the chance to build dual 1.5GHz Palomino based servers (and home computers!) that use less power than ONE P4 at 1.5GHz, use DDR SDRAM, not RIMMs, cost less and have over a years market life behind it so it can be seen as a proven solution.

    I am more interested in the Alpha though - 8 channel RAMBUS for 10GB/s bandwidth! Shame I can't afford them :-(

  • 1) The new generation always takes longer, runs slower, and has more things taken out than was originally suggested

    Guess you haven't seen the specs for the new Alphas then. Instruction throughput per clock is still higher than the previous version, whereas with the P4 it is less than the PIII. One of these processors is going in the right direction.

    Shame that Alphas cost so flippin' much. But considering the 150million transistors on the 21364 and the 300-400 million transistors on the 21464 this can be understood I suppose. 1.5MB of on-die cache adds to this though. I want to read more about the SMT the 21464 is using though.

    Oh, yes, the Alpha article can be found from the front page of AMDZone [amdzone.com].

  • especially servers (which is currently the only thing P4s are worthwhile in right now)

    Erm, riiiight. The lack of dual and quad processor capable P4s and P4 motherboards is a major reason why the P4 will not be used in servers in any company that thinks its policies through.

    The fact that the CPU and chipsets are also unproven will mean that corporates will hold back. They also realise that for servers, the FPU performance is not an issue, nor is the presence of amazing SIMD capabilities. Multi-processor capability yes, CPUs with lots more cache that 256K yes.

    The power requirements of the P4 are also staggering - up to 75W. Compare this with the "too hot" Athlon at 40 -60W, and with the coming out in March/April next year... AMD760MP chipset... Foster not arriving for another year... PIIIs not getting any faster... AMD have to leap at the chance to get a foothold in the multiprocessor server market in the next 6 months.

    I for one will jump at the chance to build dual 1.5GHz Palomino based servers (and home computers!) that use less power than ONE P4 at 1.5GHz, use DDR SDRAM, not RIMMs, cost less and have over a years market life behind it so it can be seen as a proven solution.

    I am more interested in the Alpha though - 8 channel RAMBUS for 10GB/s bandwidth! Shame I can't afford them :-(

  • When I saw the headline, I cringed and thought "Oh no, so many of the messages are going to include the acronym "AMD" in the first sentence. Ugh. And it turned out to be horribly true. I can barely wade through this stuff. Had I enough moderator points, I would tag them all as either "offtopic" or "troll."

    AMD zealots, I mean this seriously: You have moved past the realm of simply enjoying a product to becoming annoying zealots, like Jehovah's Witnesses. Please, please, please, consider taking a lower key "live and let live" approach. As it is, I think many companies shy away from anything involving the term "Linux" because they know what kind of people come swarming around when they hear that word.

    This is not a troll, nor a flame. It's a gentle suggestion that the rabid, juvenile AMD advocacy is doing harm in at least my particular case. I doubt I am alone.
  • It seems that this story is being greatly misinterpreted. It is not a story of failure, it is a story of engineering.

    Of course every geek would like a processor that has 500 integer units, 200 floating point units, and a gigabyte of on-chip RAM. But cost, development time, power consumption, heat, and reliability all come into the picture. The P4 team started with lofty goals and scaled them back to meet reality. That's how any hardware or software engineering project works. How often do you hear people say "We added tons of extra features, had better performance than projected, and finished six months early"?

    A good many consumer hardware junkies don't understand that "faster, faster, more, more, more" is not a worthy goal. The goal is "good performance given real-world constraints." I know that people who would willingly pay $500 for a video card don't understand this, but this is how engineering of commodity items works. AMD has exactly the same set of constraints. It's not like AMD engineers can magically solve all of these problems. If anything, perhaps AMD is keeping their sights lower, so they don't have to scale back as much in the end.
  • It merely needs recompiled code to perform well.

    This has been said often enough for so many different processors that it has become trite. From experience, extra bits of compiler optimization rarely pay off in a big way. Quite often, it is impossible to tell the difference between minimal and full optimization settings. I suspect that contrived examples are being used for benchmarks, such as an image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop. Sure, one tweak to the scheduler will make it run in 8 minutes instead, but how realistic is this? It isn't a win in the general case.
  • This has been said often enough for so many different processors that it has become trite. From experience, extra bits of compiler optimization rarely pay off in a big way. Quite often, it is impossible to tell the difference between minimal and full optimization settings. I suspect that contrived examples are being used for benchmarks, such as an image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop. Sure, one tweak to the scheduler will make it run in 8 minutes instead, but how realistic is this? It isn't a win in the general case.

    That's why I was talking about SPEC_CPU, the most comprehensive and well balanced CPU benchmark suite on the planet, and not some crappy toy benchmark. Indeed, the P4 does very well on recompiled toy benchmarks as well, but I didn't mention them because they don't tell us anything useful.

    FYI, SPEC_CPU is about as far from some "image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop" as one can get. Indeed, it is a suite consisting of no less than 28 benchmarks, each designed to stress different algorithmic and data set size combinations, and each very non-trivial. It is the industry's only truly cross-platform benchmark, and it is designed and revised every few years by a committee consisting of some of the foremost experts on high-performance and scientific computing, and advised by every significant MPU vendor to assure fairness. It does not, as you imply, allow any hand-tweaking of assembly code, nor--like most benchmarks--does it come in the form of precompiled binaries which may favor one platform over another. Instead, it comes completely as source code, to be compiled by a vendor supplied compiler--which must be publicly available within a certain time frame--under very specific regulations. The "base" and "peak" categories refer to different levels of allowable customization in the compiler settings, and indeed all compiler flags used must be revealed along with the results. And rather than taking 10 seconds, a full SPEC_CPU run takes a couple hours even on a P4 or high-end Alpha; on the reference machine (i.e. a SPEC_CPU2000 score of 100) it would take something like 12 hours!

    So, nice try. But trust me, the only way to beat SPEC_CPU is to built a really fast CPU. It also helps to have an amazing compiler--which Intel does with its VTune 5.0 compilers--but that allows nowhere near the potential for unfair binaries that precompiled benchmarks do. Also, being aimed at the high-performance market rather than the PC market, SPECfp2000 has been criticized by some as "unfairly" rewarding the very large memory bandwidth of the P4 compared to the P3 and SDR SDRAM Athlon. For an IMO interesting technical discussion of this issue, you might want to see this thread [aceshardware.com] over at Ace's Hardware. (See if you can guess who I am. :)
  • Though I generally agree with you. The performance per clock point is not really representative of a test.. If the PII maxes out at 1GHZ, and the Athlon is pretty close to it's limit at 1.3GHZ, and the P4 debuts at 1.5GHZ, then the likelyhood that the total performance will be higher on a P4 half a year from now is pretty good.

    Additionally, apps that really need the horsepower are going to be recompiled to take advantage of the new pipelines and SIMD instructions. What is remarkable is that this new CPU design can even keep up with pre-existing apps without recompilation.

    As for having to buy new boards, cases etc.. Do you really think that the majority of CPU purchases are made by people who would even think about opening up their cases? Though I don't have numbers, it is my understanding that businesses are the primary purchaser of computers; especially servers (which is currently the only thing P4s are worthwhile in right now). They don't upgrade; they get all new machines. So the fact that the power supplies are different are irrelevant. Even in light of the fact that they'll be marginally more expensive because of their newness.

    Personally I'm not satisfied with the P4. But that probably doesn't really interest Intel too much. They have the highest clock speed and will soon have some of the highest benchmarks.. And brilliant IT people will see these numbers and hurd.

    -Michael
  • And brilliant IT people will see these numbers and hurd.

    Now *that* is a sight to see: "Hurd" and "brilliant" in the same sentence! :)

  • Intel has discovered a physical limit on how small a chip can be, because there are only so many small people alive at any given time, that can actually work on all those tiny transistors. So, unless they resort to massive cloning of the littles of people, we will reach a limiting factor both on the size and amount of chips created.

    Having said that, I am sure our new elected president will delegate onto some smart people the task of figuring out, either cloning or chips, wichever comes first.

    Also, on the same subject, in Europe they finaly figured out the Royal Family's behaviour for the last couple of centuries, it's a human strain for the Mad Cow Desease.
  • OK - you're wrong. The 80486 was the first x86 from Intel to integrate a math coprocessor. The "original" 80486 did not have a suffix, ran at 25Mhz, and produced enough heat to brew a decent cup of tea. Later, the 80486SX was introduced, being a 80486 without an FPU, and the original 80486 was renamed the 80486DX. The 80386 did not have an integrated FPU, but could work with the 80387 or 80287 running at an equal or greater speed. The 80386 was introduced at 16Mhz. The 80386SX, introduced later was an 80386 with the external interface necked down to 16 bit to allow for cheaper system designs. There was an 80386 variant from Intel with an integrated FPU: the FastCad386. The FastCad chipset (yes, chipSET) replaced both an 80386 and an 80387 with an integrated chip and a 'dummy' chip for the 80387 socket, and provided a modest performancd boost over the stock 80386/80387 combination. There was also an 80386SL variant, which had advanced (for the time) power managemet features and was intended for mobile applications. 80386 chips were produced by other manufactures, including IBM and AMD, under license from Intel, since multiple-sourcing of CPUs was common at that time. Eventually, some of there other companies produce yet more "386" chips, including the IBM BlueLightning, and AMD's AM386-40, which was the fastest commercial 80386 ever produced to my knowledge. Enough?
  • Maybe not, but cost analysis is engineering. Don't believe me? Next time your manager asks you to outline your approach to a new problem, present him with something that requires 10,000 developers and a $60 billion equipment expenditure.

    Yes boss, our next server should use the Hoover dam as a power supply, and hand-wound relays instead of transistors for the processor core. Actually, I'd kind of like to see that...

  • That's why AMD had their DX/5 running at 133 and 160 MHz on 33 and 40 MHz busses respectively, to compete with pentiums while the K5 (and K6, and K7) was in development. AMD had to differentiate somehow, so they followed something like the Intel pattern, which everyone was familiar with. My 486-133 wasn't the fastest chip on the market, but the money my dad saved on it allowed him to buy more RAM than your average P-120 had, so it performed better.
  • they did that because 256K of FULL PROCESSOR SPEED cache increases performance more than 512K running at HALF PROCESSOR SPEED
  • compromise? naww..I think I will just stick with AMD thank you very much. I would rather not buy a new mobo, powersupply, case, heatsink, memory etc....if intel wants to sell this puppy, they should have made it compatible with the hardware that is already around. SDRAM is fine by me for now. /me avoids aiding companies that are predatory and monopolistic.

  • Funny, I'm typing this on a PIII Speedstep laptop. They make excellent mobile chips for machines.
  • I find this article rather uninformative and mostly marketing inclined.

    The first reason is: nearly all high tech projects start with rosy goals and then reconsider when they know exactly what is essential and what is feasible. So all this crap about "we wanted to do it better" is pure marketing. If they really could do it, they would, because they need something to fight Athlon.

    The second reason is: the article does not tell anything about the compromises necessary to reach high-mhz for the sake of marketing.

    And the third reason is: the article does not even hint at the possibility that P4 might have been castrated to not appear much better than Itanium/McKinley in floating point.

  • Why has the BUS speed screeched to a halt at 400 MHz when the processor itself is beyong 1.5 GHz? I don't think these grandoise plans make any sense unless something is done about the very basics.
  • Your points are pretty good, but there are two things that take away from your arguement that you do not mention.

    First, the lack of registers in the x86 architecture. Having a fast cache is great, but it's not as fast as a register, and it takes extra instructions to load and store, unless you go to the more complicated addressing modes, with the problems that you note.

    Second is the relatively finer granularity of the instructions available on a RISC architecture. Although there is some merit to making decisions based on information only available at runtime, that isn't a big factor with today's technology. What a modern x86 looks like is a microcode architecture with somewhat intelligent scheduling of the instructions. In most cases a compiler could do a better job.

    Where microcode might really be nice is in mitigating the effects of optimizing for one single processor. You could have write-once, run-optimally-on-any-x86 code, but as we see from the real world, that's working about as well as write-once, run-anywhere is with Java.

    I completely agree with you on the importance of higher level parallelism. In most cases, the instruction level parallelism in code is low. SIMD in particular seems a waste to me, since one of the few things less likely than code that doesn't have dependencies is identical code operating on different values that doesn't have dependencies. It has its places, but not that many of them. With all the indirections in neat object code, you get a lot of cache incoherency, so you don't even take as bad a hit as you might think from running threads in parallel on account of that.

    Of course, there are alternatives. Running multiple branches so that a wrong prediction won't stall is a decent, although less efficient, use of extra execution units also.

  • In designing the P5, Intel again tries to pack every possible feature onto their processor.

    The solution? Balancing the extreme power and cooling requirements of the processor by making the computer made out of PURE DEATH!!!

    NeoNecroElectrical Engineering is the future...

  • You have just made the most fatal Slashdot error:

    The only thing that matters is Quake benchmarks.

    Keep telling yourself that, or we'll send you to the re-education camp.
  • by Fervent ( 178271 ) on Thursday December 14, 2000 @05:19PM (#557738)
    According to the article there's not much performance decrease that can be directly tied to the design changes (they mention 5% loss for cutting the chip's area nearly in half. I'll take that).

    The real reason for the chip's inherent "performance losses" is the running-string that's slowly being pulled to its breaking point -- that is, the x86 architecture. Hopefully Itanium will change all that.

  • by Tomcow2000 ( 189275 ) on Thursday December 14, 2000 @05:29PM (#557739) Homepage
    Once again, The Register [theregister.co.uk] has a story with not only more info, but a much better title :)
  • by Chris Johnson ( 580 ) on Thursday December 14, 2000 @09:53PM (#557740) Homepage Journal
    So go PPC. 512K cache is _small_ for current PPCs, 1M cache is typical and 2M of cache is possible with the G4s. You don't _have_ to cling to x86 just because an industry is desperately trying to keep it hobbling along. It's possible to not use x86. For that matter, UltraSPARC cache can be up to _four_ megs.
  • by Detritus ( 11846 ) on Thursday December 14, 2000 @09:22PM (#557741) Homepage
    I've always wondered why one of the vendors hasn't put a divide-by-two circuit on the clock input pin of the microprocessor. Then they could claim that their 800 MHz chip was actually a 1.6 GHz chip.

    A friend told me that in the early days of portable transistor radios, some manufacturers would intentionally add non-functional transistors to their radios, just so they could advertise them as N transistor radios, where larger values of N were "better".

  • by max99ted ( 192208 ) on Thursday December 14, 2000 @05:20PM (#557742)
    Watched the A&E Biography on Steve Wozniak last night. One of his designs (Apple 1 methinks) was revolutionary for its time - a reduction from about one thousand chips on board to sixty - way ahead of what anyone else was doing.

    Maybe it's me, but I can't think of a simliar 'breakthrough' advance in recent years. I remember reading somewhere that computers are approaching the 'limits' of current architecture design - we can only crank out so much from today's motherboard/x386 technology. I know that optical computing is slated as the next wave, but I can't help thinking that to bring this to light there needs to be a new "Apple I" breakthrough. Am I off base here?

  • by Christopher Thomas ( 11717 ) on Thursday December 14, 2000 @07:54PM (#557743)
    The real reason for the chip's inherent "performance losses" is the running-string that's slowly being pulled to its breaking point -- that is, the x86 architecture.

    Actually, while inconvenient, the x86 architecture isn't as horrible a limit to performance as a lot of people seem to be assuming. The main problem is the extra latency in the decode stage, which lengthens the pipeline somewhat, but the P4's trace cache takes care of that.

    The real problem with the P4 is that it has very wierd optimization requirements (the whole "bundle" thing) and so needs a very smart compiler if code is to run quickly. Generally, even if compilers like this exist, they aren't used (remember the original Pentium?).

    The other problem with the P4 is the long pipeline, which exacerbates stall problems.

    As for architecure in general, heat issues are what's limiting clock speeds (for x86 and non-x86 processors alike). However, the main limit people are noticing is the limit to the number of instructions you can run in parallel. As long as you're executing only one thread, you're not going to be able to sqeeze more operations per clock beyond a certain point. The "performance problem" isn't with clock speed - it's with people expecting new chips to do more, clock for clock, than old chips while running serial programs. This parallelism problem affects all chips - x86 and non-x86.

    This is why the major manufacturers are starting to look at SMT chips (Symmetric Multi-Threading) seriously. Running multiple threads in parallel on one chip doesn't take much extra hardware, and makes it *much* easier to schedule concurrent instructions and to keep on running when one instruction stalls (your "Instruction-Level Parallelism" goes up in proportion to the number of threads).
  • It merely needs recompiled code to perform well.

    On what am I basing this apparently heretical statement? On SPEC_CPU2000 [spec.org], the most demanding, well balanced, most respected cross-platform CPU benchmark in the world. As you can see if you peruse these lists, the P4/1500 has the highest scores of any shipped CPU in the world, both in SPECint [spec.org] (base and peak) and in SPECfp [spec.org] (base only).

    Before any of you reply and think you've caught a mistake, the Alpha EV67/833 is *not* publicly available, and won't be until January, at which point it will take back leadership in SPECfp_base and SPECint_peak. Of course, the P4/1700 will probably take back the lead when it's released in March or so. Indeed, the P4 and Alpha will likely trade the top SPEC spot back and forth at least until the EV68 (EV67 moved to .18 um process and with on-die L2 cache) makes an appearance (Q2?), if not all the way until the EV7 (EV68 with integrated on-chip *8-channel* RDRAM controller) is released (Q4?).

    This is why all this banal talk about the P4 being a crappy chip or (in the wake of this article) a "crippled" chip is ignorant drivel. SPEC_CPU is an exceptionally well designed, balanced, and comprehensive benchmark stressing a CPU to its limits in all sorts of ways. Why then the P4's disappointing performance on all those other benchmarks? They are all on "legacy" code--code compiled with the P6 core in mind. Because the P4 represents the first chip with a new core architecture (the horribly misnamed "NetBurst" core) from Intel in 5 years, it has a lot of pretty radical design features which don't take well to code compiled for the P6 core. While this means the P4 is pretty a useless (or at least very overpriced) solution to running today's code--and indeed, most code released for at least the next year or so--it has nothing to do with how good a *design* it has, which is ostensibly the point of this discussion. Indeed, the PPro--the first P6 core chip--posted very "disappointing" benchmarks on legacy code when *it* was released 5 years ago; many observers wrote it and the P6 core off as underperforming overdesigned wackiness from Intel. It was arguably the most successful and innovative CPU core ever. Not so incidentally, this was strongly forshadowed by its brief theft of the SPECint95 performance crown from the top Alpha of the time...

    Now to dispense with the most repeated "points" we've seen thus far.

    1) "This just goes to show that x86 is a dead ISA with no headroom to grow." Not the most unexpected statement to be found on /., but let's just say that the other 99.99% of the world that enjoys backwards compatability will make sure x86 stays alive for quite a long time to come thank you. On a technical (rather than marketing level), though, this is ridiculous bunk as well, as the fact that the P4 beats every released 64-bit 10-times-as-expensive RISC chip with 30-times-as-expensive platforms, on SPEC_CPU--a benchmark specifically designed to stress exactly those high-performance situations demanded of professional level workstation and server machines--demonstrates quite nicely.

    Yes, x86 is a bad ISA, and yes it presents a problem to be overcome by chip engineers. But it has been overcome and will continue to be overcome--today by taking on a decoding stage to x86 processors that turns x86 instructions to RISC-like instructions for internal operations (taken out of the critical path by the P4's trace cache), and tomorrow perhaps by dynamic recompilation software ala Transmeta, IBM's DAISY, and HP's Dynamo, techniques which are still in their infancy and *may* end up providing better-than-compiled performance even without the benefit of converting to a more optimal ISA. The other negative of the x86 ISA, namely the paucity of compiler-visable registers, is indeed a problem, although one partially aleviated by rename registers and partially by evolutionary extensions to the x86 ISA, such as SSE2, which will eventually replace much of the god-awful stack-based x87 FPU ISA.

    The real question is, does the performance hit generated by sticking with x86 exceed the performance gain generated by having a much larger target market, and thus more money to spend keeping up with the latest process technology and thus getting faster clocked CPUs? The answer thus far has been a rather resounding "no"--that is, the economies of scale granted by staying x86 have meant processors which are outright faster and cost much much less.

    After all, there is no doubt that were the Alpha not around 18 months behind Intel in terms of process technology, the EV67 would be much faster than the P4. On the other hand, the EV67 gets to take advantage of resources that Intel could never dream of in a mainstream chip--like a 300+mm^2 die size, extra wide memory buses, and 4-8MB L2 caches--because of the tremendous added cost. And even with all that plus what is widely acknowledged as the best CPU design team on the planet, the Alpha only manages to keep up with the P4.

    Moreover, the rest of the 64-bit world--despite the same advantages as the Alpha (well, except their design team)--can barely keep up with the P3, and that's a 5 year old design. They may be available in multi-chip boxes scaling to kingdom come, but on the level of individual chips, the best that Sun, IBM, HP or MIPS has to offer is pretty lame, despite all the advantages of a RISC ISA. Of course, the same old folks will be claiming that x86 is an inherent dead end when the P4 (or whatever Intel is calling its current NetBurst core by then) scales past 4 GHz two years from now, well ahead of anyone in the RISC world. And we'll hear it again in 4 or 5 years, when Intel releases another all-new x86 core.

    2. "The P4 should have left in all those features this article talks about." Uhhuh. Sure. Um...now, who would know more about this? Would that be you, having read some article on the Internet? Or would that be Intel's engineers who maybe understand the P4 core and the issues involed with these features a bit better than you, and who had the benefit of cycle-perfect simulations on dozens if not hundreds of possible P4 variants running every concievable type of code??

    If there's a feature which doesn't make it into a finished CPU, it's because of one of two reasons:
    1) The designers didn't think of it;
    2) The designers couldn't figure a way to implement it and make it work with the rest of their design in such a way that it raised performance/cost.

    Needless to say, "The designers thought of it, implemented it (which they did in this case), and it was a good feature (i.e. improved performance/cost on a majority of code), but then made a boneheaded decision not to use it," is *not* on the list.

    IMO, the features listed here are all better off gone from the current P4. The only really intriguing one--another FPU--was *not* left off for die size considerations (i.e. cost): FPU's are not very big. It was left off for performance issues. You see, while "more is better" sounds like a nice philosophy, adding an extra FPU would have meant extra decoding and routing logic in the FP section of the chip. Considering Intel actually went to the considerable trouble of implementing this feature and then decided against it, it is very likely that this extra logic was in the P4's critical path. Thus while including the extra FPU would have meant extra performance/clock, it would have meant lower overall clock speeds. Obviously Intel felt the tradeoff worked better without the extra FPU than with it.

    If you "disagree" with their decision, please refer to the cycle-perfect simulators which Intel has and you don't, and the P4/1500's SPECfp2000 score which is a mere, oh, 68% better than the fastest P3. Also you might note that the P4 is scaling quite well with clock speed on SPECfp, that it will spend most of its life at speeds well above 2 GHz, and that it will likely sell most (at least for the next 2 years) in combination with a memory subsystem providing *less* bandwidth than the current dual-RDRAM i850 chipset--all of which point to this being a very smart decision on Intel's part. (The reasoning is this: if the P4's FPU can already keep up quite nicely with a larger memory bandwidth, then why increase FPU power/clock when most P4's will have higher clocks and lower bandwidth to keep them fed?)

    As for the features I'd like to see added to the P4 when it moves to its .13 um Northwood variant next summer: one of them was on the list, i.e. a 16kb L1 data cache. The reason it was left off was clearly not die size but clock scalability--Intel decided having a 2-cycle latency L1 was more important than having a bigger one, and I totally agree. After the move to .13, though, perhaps a 16kb 2-cycle L1 will no longer limit clock scalability, just as the PPro's 8kb L1's were expanded to 16kb each with the PII. The other, a 512kb L2, would take up much too much die space at .18um to be feasible; it too, may make it to Northwood, depending on Intel's target die size. Needless to say, whatever they decide, it will be a much better informed decision than I or anyone here could presume to make.

E = MC ** 2 +- 3db

Working...