
P4 - The Art Of Compromise 117
Buckaroo writes: "Interesting article at EETimes on what Intel's architects originally had in mind for the P4 - 1 slow ALU, 2 fast ALUs, 2 FPUs, 16K of L1 cache, 128K of L2 cache, 1 MB of external L3 cache, etc. - It was all too big and hot, so a bunch of it got the chop." This article sheds new light on the reasons behind performance problems with the chip.
Re:herm... (Score:1)
Intel is deemed evil by slashdot trolls, therefore, intel is evil. AMD is in competition with intel, therefore AMD is better...
I have yet to see AMD do anything worthwhile for the computer industry, they decresed the total cost of ownership for PC's, but in doing so, forced the slowing of technology advance. Think about it for a little while, AMD releases thier first chip to be comparable in quality and performance to an intel chip (Athlon). Don't even talk to me about the k6 series, they were pathetic. Having this cheaper alternate, Intel responded with a 'Marketing' release of a "new" processor, the PIII: thus delaying the release of the first processor to use a considerably new design since the ppro. Ppro, PII, and PIII are all basicly the same. Sure, a few added instructions, but the same basic design plan. The p4 is much different. I'd like to see something come of it instead of people belittle it because it is only 50% faster than anything else on the market. Shut up people! the P4 is here, and it'll be here to stay. Its up to people to decide on performance so that in the future, I can afford one...
Re:A simple question... (Score:1)
Re:The P4 is the world's fastest microprocessor. (Score:1)
You raise a very good point here. SPEC tests a system platform because that is what is relevant. No one cares about raw CPU speed, nor should they. The CPU is designed for the system in which it will function. Things like caches, instruction window size, branch prediction, etc. are all designed to tolerate the latencies of the expected system platforms.
Using a "more conventional compiler" doesn't make any sense. What's a conventional compiler? The compiler and CPU designes should go hand-in-hand.
--
Re:Registers and scheduling. (Score:1)
This is false. Lack of registers increases the number of required memory operations. Not only do you increase data cache bandwidth and occupancy, you do the same on the instruction cache as well. Instructions are not free, regardless of whatever "throw hardware at it" myth is popular today. Memory instructions also create headaches for the instruction scheduler. It's much easier to do dependency checking on static indicies.
There are more things to fetch, decode, schedule, execute and retire. What's good about that?
It hurts. A lot. Check out some of our papers [umich.edu] on the subject, especially this one [umich.edu], which contains many references to other work.
--
Re:The P4 is the world's fastest microprocessor. (Score:1)
Partially correct. Hardware register renaming does nothing to alleviate the low compiler-visible register count. The compiler still has to spill much, much more than it should on the x86. It is true however that SSE2 is a big improvement. Getting rid of the FP stack will really boost performance in that arena.
Why not add more compiler-visible registers, one may ask? Well, the problem is the encoding size. The x86 ISA is very nice in the sense of code size. This matters when you consider ICache sizes. Intel can get away with smaller ICaches because more instructions fit into the cache for a given size than in a similar 32-register RISC machine. It's interesting that Intel essentially used the same trick with the trace cache. They compress the micro-ops so more of them will fit into the cache.
Amen! Intel engineers are not stupid. They removed these features for good reasons. Intel has a policy that a 1% area increase must result in at least 1% performance increase. This was obviously not the case with these structures.
Frankly, I am stunned at when Intel has been able to do with the x86 ISA.
--
Re:Registers and scheduling. (Score:1)
IMHO, many people underestimate the problems associated with lots of instructions. The effective window size of the processor is reduced and more memory operations wreak havoc on dependency checking, for example.
Really, we focus on how a large register file can be used, but you are essentially correct.
Is 20% modest? It's a heck of a lot more than most hardware optimization papers get. :) The paper you cite didn't include our more recent work on speculative register promotion and some windowing stuff we're looking at.
You're right in that a 20% speed increase does not make a processor overwhelmingly more marketable. But remember that most of the speedup we see from generation to generation comes from the circuit technology.
It is interesting as you point out that x86 seems to be holding its own against machines with larger register files. I suspect this may have something to do with circuit technology, better cache utilization and other such things. I only know from our own experiments that if we turn our register set down to 8 registers, things are really hosed. Not only is there lots of spilling, but the compiler has to throttle itself to avoid even more spilling.
Granted, most of the spill code is going to be in the cache. One way to look at it is that this extra code forces the caches to be larger. Eliminating instructions (especially memory ops) reduces the size of the machine overall (caches, issue logic, etc.). This in turn will greatly reduce production costs.
Ah the interplay...this is why I love sitting on the fence! :)
--
Re:A simple question... (Score:1)
That is to say, it's more important, at this stage, to reduce the time required to start reading something from memory. Basically, if you're reading in lots of relatively small data, a lower latency, lower bandwidth memory architecture will finish before the higher latency, higher bandwidth architecture.
--
Change is inevitable.
It Steave Jack!!! (Score:1)
Re:No (Score:1)
Re:Not again. Okay, one more time... (Score:1)
Eeh? What about the fact that you're limited to 8 registers, then? Sure enough, a modern x86 CPU has many more internally (using register renaming), but wouldn't it be nice having the option to compile your code to, say 40 registers instead of 8? Or what about the lack of a properly laid out FP register system, instead of that quite stupid stack layout?
/* Steinar */
Re:More info (Score:1)
Clock (Score:1)
I Love Red Hat (Score:1)
Re:herm... (Score:1)
Re:Clock (Score:1)
Re:A simple question... (Score:1)
Quit griping about the low low levels of the computer unless you actually have a reason to. And by that i mean if you're running a linux kernel with a debugger on and it says that 90% of the time the cache is empty and it's waiting on the memory bus, then you'll have a gripe. But most operations these days are taking place inside the CPU and the chip is not spending a *great* deal of time waiting for new data, except in extreme cases...
Re:herm... (Score:1)
It was MSFT, with their crappy OS that put intel behind in development. AMD just took advantage of it.
JON
PPC vs IA64 (Score:1)
I very much doubt it - PPC is used a lot in embedded stuff because if its low power use, and a G4 makes a useful DSP. It could one day disappear from desktops and higher though. Despite any technical merits, the G4 has been a marketing disaster - people just see 500 MHz and think "slow".
As for Apple's scorched earth policy - I agree that they handled it really badly, but I think overall they didn't have a lot of choice about the clones, although it still would have been nice to see CHRP, even if you needed more to run Apple software. PowerComputing was starting to undercut their hardware sales quite badly, and contrary to what I have seen suggested, Apple wouldn't effectively survive as just a software company.
Re:CISC-RISC--dumb comparison (Score:1)
john
Re:hardware eng = software eng = crap! (Score:1)
No.
A 2x die size saving results in 5% drop in performance because we're in the realm of diminishing returns. Chip designers are trying to squeeze the max. performance out of chips, and have to resort to cleverer and cleverer ideas and bigger and bigger areas of silicon.
john
This UrL (Score:1)
http://www.eetimes.com/story/industry/semicondu
Re:This UrL (Score:1)
Re:Another compromise (Score:1)
--
Re:Clock (Score:1)
I'm too lazy to check the facts from the Motorola website, but that's how it's done on (at least) HP 9000/380 machines.
Re:A simple question... (Score:1)
No, when I'm being told the *technical details* about something, I demand accuracy! It infuriates me when some techie tells me that "I don't need to know" or that "I don't want to do that".
--------
Re:A simple question... (Score:1)
Re:Next breakthrough (Score:1)
Wozniak had it easy, relative to microprocessor architects nowadays.
Re:My design (Score:1)
Later,
ErikZ
Re:Hmmm (Score:1)
geez (Score:1)
Yeah (Score:1)
And what you say about buying cheaper CPUs to afford more RAM: thank you. I make this point to every associate who thinks buying some 733 MHz "essentials" computer that only has 32MB of RAM and a 2MB video card is a good idea. I tell them a Pentium Pro 200 with 128MB RAM and a decent video card will outperform that joke of a machine.
AMD386 40 (Score:1)
Sorry, your post is absolutely correct, but I felt compelled to post this rant for some reason.
Of course (Score:1)
It's very difficult to manufacture an architecture that is both more efficient and remains 100% compatibility (performance and otherwise) with previous generations. Sure that Athlon TBird 1.2 GHz will kill that 286 in any 16 bit app, but I bet if you break down performance per megahert, the 286 may very well come out on top.
Sure (Score:1)
As for your statements to the 386 and 486, they're a bit off. Yes, a 486SX had no math co-processor and the 486DX did, but the SX/DX in the 386 era referred to its data bus. The 386SX had only a 16 bit data bus while the DX had a 32 (I think) bit one. While the average 386DX did come with the 387 processor, you could not assume this from the DX name. Intel's naming standards are so clear, aren't they? I especially liked how they named the 486 DX/3 the 486 DX/4 because the 4 "reminded people of a 486." Right.
Dates are just slightly off (Score:1)
80486: 1989
80586: 1993
60686: 1995
I think since the Pentium 4 is out now, we can assume it's not to be released in 2001
Hmmm (Score:1)
Re:Sure (Score:1)
Re:The P4 is the world's fastest microprocessor. (Score:1)
I couldn't get a real picture of the P4 die; best I could manage is the cutesy little colored rectangles on page 6 of this Intel PDF [intel.com]. Point is, assuming an overall colored rectangle size of 217mm^2, the "Enhanced Floating Point/Multi Media" section comes out to under 17mm^2 by my crude measurements. And I frankly doubt that when they say that adding another FPU would "double the floating point size", they actually mean double everything in that little teal box. Even assuming I'm wrong, 16.5mm^2, while certainly bigger than the ALUs (and don't forget, this "floating point" box includes integer SIMD execution as well) is a mere 7% of total die size. While this is somewhat significant, if they really wanted it in they certainly could have made room for it. As a percentage of overall die space it's much smaller than the P3's FPU.
What I saw instead was the admission that adding the extra FPU would have added an extra stage to the pipeline (extra decoding step). It may be that the pipeline was not well balanced with this extra stage, or that it was still in the critical path even with its own pipeline stage, or just that they thought 19 (not including those outside the trace cache) was enough.
In any case, I'm not at all convinced that this decision had to do with die size at all, but rather with rampability and overall IPC. Indeed, as I said, with properly compiled code, the P4's "crippled" FPU is able to scream along, keeping up just fine with its 3.2 GB/s memory bus. Considering most P4s will have higher clock speeds and less memory bandwidth, why add extra FPU units? About all its extra 2 FPUs do for the Athlon is help it in cache-constrained toy benchmarks. In the real world, FPU work increasingly means data sets too large to fit in on-chip cache, and a single FPU becomes more than adequate to keep up.
Re:I Love Red Hat (Score:1)
Re:ilrh (Score:1)
=D
Re:Much emphasis on CPU and video. But what of sou (Score:1)
Re:Much emphasis on CPU and video. But what of sou (Score:1)
--
Cheers
Intel == Microsoft && AMD == Linux (Score:1)
Intel has been making processors for a very long time and people have come to rely on them as a quality company. But as of late they have been having trouble and people are looking for alternatives to using Intel products.
Microsoft has been making operating systems for a while now and they have been trusted as a good company. Microsoft makes other software that people also like. Lately people have been getting feed up with Microsoft and are looking for alternatives.
AMD somewhat of a suprise company. At first many people did not rely on them and did not think that they made quality processors. Now many people are starting to support AMD because other companies, mainly Intel are going down.
Linux an upstart operating system that some people still know nothing about. Many people are learning about Linux everyday and are accepting it as an alternative to Microsoft and its products.
Look at the facts that's a very uncanny resemblence.
>neotope
Re:Confused about ... tech (Score:1)
Buckets,
pompomtom
Re:A simple question... (Score:1)
Re:ilrh (Score:1)
Its one of the fir$t po$ters syndrome, press the button quick! This thread brought to you by lameness. If you have an opinion about it, you should Meta Moderate [slashdot.org] on a regular basis to be sure lame moderators get the axe.
Hey is that why I haven't been a moderator in a while?
Re:herm... (Score:1)
Re:Not again. Okay, one more time... (Score:1)
Now also hardware is beta! (Score:1)
1) no multi cpu
2) new SSE2 fpu but old fpu slower
3) custom socket that will be replaced
4) little chaches
5) very big heatsink
It is like software, intel engineers are making a new cpu for
No (Score:1)
Re:DO PEOPLE STILL EVEN USE INTEL CHIPS THESE DAYS (Score:1)
It appears that he stuck in a hyperlink with a very long, all lower case HREF to reduce the percent of the comment that was in caps to below the filter threshold. He avoided having the link actually work by giving it a zero length field to link to, so your browser doesn't display it. It appears that the lameness filter needs some work.
The New Compromise: (Score:1)
Re:Hmmm (Score:1)
Re:PPC (Score:1)
One Slashdotted site coming up.. (Score:1)
Re:The P4 is the world's fastest microprocessor. (Score:1)
Alice: "It's not the magic eye, doofus."
Hummmm.... (Score:1)
Looks like they've been slashdotted.
Re:Marketing is not engineering (Score:1)
Exactly, but cost analysis in this case would be the job of engineers. And the engineers are the ones who would have the best idea of the die size/cost in advance. So, if someone says "the engineers really wanted to build something 2x bigger but we wouldn't let them", (1) this is nothing new and (2) he's implying that his engineers are idealistic and out-of-touch with business reality. All engineers dream of the next bigger and better thing. So the remaining question is: what is the real reason they had to cut down. Looking at the die photo, L2 and FP units are not that big compared to all that pipeline-logic.
Yes boss, our next server should use the Hoover dam as a power supply, and hand-wound relays instead of transistors for the processor core. Actually, I'd kind of like to see that...
I've seen more project proposals of the type "give me 3 engineers and some time" and we'll come back with something bigger than the Hoover dam.
Re:Other limits exist within the architecture (Score:1)
The "Ideal RISC vs. real-life x86 CISC" argument is well founded. The paper-RISC ideals that people have in their heads are much better than the legacy x86 compatbile stuff, obviously. So don't be fooled by the blank-ISC is better. There's solid proof that Intel's IA32 cores would be far more powerful if they didn't tote legacy hw.
Let's say you took Intel's IA-32 and pruned out the following:
* 8/16-bit operands and addressing modes
* x86 floating point instructions
* prefixes/overrides
90% of the decode hardware is simply there to handle this legacy crap, at least according to Intel at ISSCC'96.
The decode pipeline alone on the P3 (don't know about P4) would be 4x smaller. That is a huge performance gain. So it isn't Cisc vs Risc, it is 20-year old legacy Cisc vs clean-slate Risc. The fact that the former holds it's own against the latter in its current state is a powerful message.
Thing is, the NT kernel never even has to use the three bullets above, but the poor decoder has to handle anything. Sad.
As for loop unrolling, bundling and register renaming, try this: write some trivial C code with some loops and doubles and fp math. Compare the ASM code compiled by the MSVC 6 (barf) default compiler, and Intel's beta proton 5. You'll be surprised at how weird the resulting code looks, and how much faster it runs.
---
Re:Next breakthrough (Score:1)
oops...I meant x86 technology. Obviously the 386 can only crank out so much
The way of things (Score:1)
Designers Dream
Engineers try to make it work
Bean counters fsck it all up
Marketing says it's the great solution
Customers buy it anyway
Customer service is the last to know of any problems or design changes
Even at AMD...
--
Re:PPC (Score:1)
What is the DEAL with CPU cache!? (Score:1)
sounds like a fast food order (Score:1)
typical intel crap... (Score:1)
Another compromise (Score:2)
Re:A simple question... (Score:2)
Your latency is just as bad as with 3 GB/sec, and for most things you're likely to do, that's more important.
It's all a matter of cost. If cost isn't a problem, prefetch can be used along with cache to help minimize the latency issue. Especially if there is a prefetch instruction so the compiler (or even the programmer through a pragma) can issue a prefetch instruction.
I imagine the issue there is related to the extra hardware complexity needed to make that happen.
Not again. Okay, one more time... (Score:2)
Look, we've been over this before, but I'll say it again. Yes, Intel's ISA sucks. No, it isn't "slow". IA32 hasn't been directly implemented in a CPU core since the Pentium MMX. The only place on the P6 and later designs that deal with x86 are the decoders. Internally it uses RISC-like micro-ops, which is convenient since most compilers only use the simpler x86 instructions. In the P4 the situation gets even better, since the trace cache holds decoded information -- the decoders aren't even in the critical path! Which is why x86 processors are able to compete successfully with RISC cores on performance.
Or, in short, IA32 itself has nothing to do with the P4's lack of oomph. Which should be obvious, since the things it's being compared unfavorably to are other x86 processors!
Ahem. As to the 'not much perfomance decrease'... Well, maybe a re-read is in order, or at least a re-think.
The 5% was for cutting the FPU's area in half, not the whole chip! 5% is a huge effect on overal performance for a change in just one part of the architecture. For something that relied solely on FPU performance (the photoshop and 3dsmax benchmarks AMD does so well on), it would certainly be much more than 5%.
And that was just one number for one change. That no other specific numbers were given doesn't imply that they were 0%. I'd say it's more likely that they are larger than 5%.
Originally, the L1 data cache was supposed to be 16K, accessed in 1 cycle. That wouldn't work, so instead of increasing the access time they cut the size in half. I guarantee you that cutting the size of the l1 in half has a big impact on performance.
An off-chip L3 would have been nice, too. Especially when paired with high-latency RDRAM. This would have had a huge impact on performance, especially in benchmarks that are sensitive to memory latencies. Doubling the size of the l2 (but increasing access time as a result) probably doesn't mitigate this much.
The P4 of the Intel architects' dreams would have smoked. Instead we have what we have. x86 has nothing to do with it... economics and engineering reality do.
Lastly, Itanium is going to suck. Intel has said as much themselves. It's neat technology, but not well designed. It's the Daikatana of the chip industry -- a running joke that some people hope will come off well anyway, but who inevitably will end up dissapointed.
Re:Registers and scheduling. (Score:2)
[Emphasis added.]
There are more things to fetch, decode, schedule, execute and retire. What's good about that?
As clearly stated in my original post - nothing at all. However, I question how much of an _impact_ the bad side effects have in practice. It's non-negligeable, but that leaves a lot of territory open.
It hurts. A lot. Check out some of our papers on the subject, especially this one, which contains many references to other work.
Done. I compliment you on your fascinating approaches to register use optimization. However, most of your works focus on how the program use and physical performance of a register file of a given size may be improved. The dependence of performance on register file size is only studied in one document ("The Need for Large Register Files in Integer Codes"), and the advantage of a relatively large register file (at least for the 64-vs-32 case) is found to be relatively modest (5%-20%).
A factor of two speed difference makes a processor unmarketable. A 20% speed difference doesn't (witness the holy war still going on between Intel and AMD proponents).
The effect of a small register file is undoubtedly more severe as size decreases, but I have yet to see evidence of truly earth-shattering performance impacts. Circumstantial evidence suggests that the effect is not earth-shattering (SPECmarks for high-end workstation chips fail to thoroughly trounce SPECmarks for x86 chips for comparable configurations, and the PowerPC architecture fails to blow x86 out of the water).
Most certainly, a larger register file is nice, and causes a speed improvement - but the effect of a small register file does not seem to be as devastating in practice as you appear to be suggesting above.
Registers and scheduling. (Score:2)
First, the lack of registers in the x86 architecture. Having a fast cache is great, but it's not as fast as a register, and it takes extra instructions to load and store
This is true, and greatly hampers things like loop unrolling on the x86. However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.
Second is the relatively finer granularity of the instructions available on a RISC architecture. Although there is some merit to making decisions based on information only available at runtime, that isn't a big factor with today's technology. What a modern x86 looks like is a microcode architecture with somewhat intelligent scheduling of the instructions. In most cases a compiler could do a better job.
Actually, since the Pentium Pro, x86 processors have been fundamentally RISC-ian. x86 instructions are decoded into "micro-ops" (Intel's term), which are essentially RISC instructions. These can be scheduled by the processor as effectively as RISC instructions.
The decoding adds latency, but that's what the P4's "trace cache" is for. Arguably, a compiler with access to the underlying RISC instruction set could do better scheduling, but in practice the gain is marginal (especially since most people don't seem to use really-good compilers). I also have a sneaking suspicion that basic blocks in most code are small enough to fit inside the processor's scheduler window, which means that the compiler probably _wouldn't_ do a better job in most cases than the hardware scheduler. Higher-level transformations like loop unrolling have benefit even if done at a CISC level.
In summary, I'm not sure there's a very big performance hit from the instruction granularity (just a silicon hit).
I am impressed with your knowledge of the subject, though.
Re:I don't know what to say (Score:2)
(jfb)
This is an all too familiar story (Score:2)
There seem to be two constants in processor generations that I can see:
(1) The new generation always takes longer, runs slower, and has more things taken out than was originally suggested,
(2) the old generation gets ramped up to clock speeds way beyond what was originally anticipated while we wait.
I remember when the PowerPC G4 was going to have a lot of changes including multi-core and run at most of a GHz - what we eventually got was in some ways a smartened up (mostly fp & memory improvements) G3 + altivec. Meanwhile IBM keeps making the G3s faster and faster.
Then of course there is the story of how the whole x86 architecture wasn't supposed to get this far before being replaced by something with less cruft.
P4 is not a suitable server processor ... yet (Score:2)
especially servers (which is currently the only thing P4s are worthwhile in right now)
Erm, riiiight. The lack of dual and quad processor capable P4s and P4 motherboards is a major reason why the P4 will not be used in servers in any company that thinks its policies through.
The fact that the CPU and chipsets are also unproven will mean that corporates will hold back. They also realise that for servers, the FPU performance is not an issue, nor is the presence of amazing SIMD capabilities. Multi-processor capability yes, CPUs with lots more cache that 256K yes.
The power requirements of the P4 are also staggering - up to 75W. Compare this with the "too hot" Athlon at 40 -60W, and with the <20W Palomino 1.5GHz coming out in March/April next year... AMD760MP chipset... Foster not arriving for another year... PIIIs not getting any faster... AMD have to leap at the chance to get a foothold in the multiprocessor server market in the next 6 months.
I for one will jump at the chance to build dual 1.5GHz Palomino based servers (and home computers!) that use less power than ONE P4 at 1.5GHz, use DDR SDRAM, not RIMMs, cost less and have over a years market life behind it so it can be seen as a proven solution.
I am more interested in the Alpha though - 8 channel RAMBUS for 10GB/s bandwidth! Shame I can't afford them :-(
Re:This is an all too familiar story (Score:2)
Guess you haven't seen the specs for the new Alphas then. Instruction throughput per clock is still higher than the previous version, whereas with the P4 it is less than the PIII. One of these processors is going in the right direction.
Shame that Alphas cost so flippin' much. But considering the 150million transistors on the 21364 and the 300-400 million transistors on the 21464 this can be understood I suppose. 1.5MB of on-die cache adds to this though. I want to read more about the SMT the 21464 is using though.
Oh, yes, the Alpha article can be found from the front page of AMDZone [amdzone.com].
Re:herm... (Score:2)
Erm, riiiight. The lack of dual and quad processor capable P4s and P4 motherboards is a major reason why the P4 will not be used in servers in any company that thinks its policies through.
The fact that the CPU and chipsets are also unproven will mean that corporates will hold back. They also realise that for servers, the FPU performance is not an issue, nor is the presence of amazing SIMD capabilities. Multi-processor capability yes, CPUs with lots more cache that 256K yes.
The power requirements of the P4 are also staggering - up to 75W. Compare this with the "too hot" Athlon at 40 -60W, and with the coming out in March/April next year... AMD760MP chipset... Foster not arriving for another year... PIIIs not getting any faster... AMD have to leap at the chance to get a foothold in the multiprocessor server market in the next 6 months.
I for one will jump at the chance to build dual 1.5GHz Palomino based servers (and home computers!) that use less power than ONE P4 at 1.5GHz, use DDR SDRAM, not RIMMs, cost less and have over a years market life behind it so it can be seen as a proven solution.
I am more interested in the Alpha though - 8 channel RAMBUS for 10GB/s bandwidth! Shame I can't afford them :-(
This is about the Pentium 4, not AMD, you know? (Score:2)
AMD zealots, I mean this seriously: You have moved past the realm of simply enjoying a product to becoming annoying zealots, like Jehovah's Witnesses. Please, please, please, consider taking a lower key "live and let live" approach. As it is, I think many companies shy away from anything involving the term "Linux" because they know what kind of people come swarming around when they hear that word.
This is not a troll, nor a flame. It's a gentle suggestion that the rabid, juvenile AMD advocacy is doing harm in at least my particular case. I doubt I am alone.
This is not a story of failure (Score:2)
Of course every geek would like a processor that has 500 integer units, 200 floating point units, and a gigabyte of on-chip RAM. But cost, development time, power consumption, heat, and reliability all come into the picture. The P4 team started with lofty goals and scaled them back to meet reality. That's how any hardware or software engineering project works. How often do you hear people say "We added tons of extra features, had better performance than projected, and finished six months early"?
A good many consumer hardware junkies don't understand that "faster, faster, more, more, more" is not a worthy goal. The goal is "good performance given real-world constraints." I know that people who would willingly pay $500 for a video card don't understand this, but this is how engineering of commodity items works. AMD has exactly the same set of constraints. It's not like AMD engineers can magically solve all of these problems. If anything, perhaps AMD is keeping their sights lower, so they don't have to scale back as much in the end.
Re:The P4 is the world's fastest microprocessor. (Score:2)
This has been said often enough for so many different processors that it has become trite. From experience, extra bits of compiler optimization rarely pay off in a big way. Quite often, it is impossible to tell the difference between minimal and full optimization settings. I suspect that contrived examples are being used for benchmarks, such as an image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop. Sure, one tweak to the scheduler will make it run in 8 minutes instead, but how realistic is this? It isn't a win in the general case.
Re:The P4 is the world's fastest microprocessor. (Score:2)
That's why I was talking about SPEC_CPU, the most comprehensive and well balanced CPU benchmark suite on the planet, and not some crappy toy benchmark. Indeed, the P4 does very well on recompiled toy benchmarks as well, but I didn't mention them because they don't tell us anything useful.
FYI, SPEC_CPU is about as far from some "image filter that takes 10 seconds to run and spends all its time inside of a 16 instruction loop" as one can get. Indeed, it is a suite consisting of no less than 28 benchmarks, each designed to stress different algorithmic and data set size combinations, and each very non-trivial. It is the industry's only truly cross-platform benchmark, and it is designed and revised every few years by a committee consisting of some of the foremost experts on high-performance and scientific computing, and advised by every significant MPU vendor to assure fairness. It does not, as you imply, allow any hand-tweaking of assembly code, nor--like most benchmarks--does it come in the form of precompiled binaries which may favor one platform over another. Instead, it comes completely as source code, to be compiled by a vendor supplied compiler--which must be publicly available within a certain time frame--under very specific regulations. The "base" and "peak" categories refer to different levels of allowable customization in the compiler settings, and indeed all compiler flags used must be revealed along with the results. And rather than taking 10 seconds, a full SPEC_CPU run takes a couple hours even on a P4 or high-end Alpha; on the reference machine (i.e. a SPEC_CPU2000 score of 100) it would take something like 12 hours!
So, nice try. But trust me, the only way to beat SPEC_CPU is to built a really fast CPU. It also helps to have an amazing compiler--which Intel does with its VTune 5.0 compilers--but that allows nowhere near the potential for unfair binaries that precompiled benchmarks do. Also, being aimed at the high-performance market rather than the PC market, SPECfp2000 has been criticized by some as "unfairly" rewarding the very large memory bandwidth of the P4 compared to the P3 and SDR SDRAM Athlon. For an IMO interesting technical discussion of this issue, you might want to see this thread [aceshardware.com] over at Ace's Hardware. (See if you can guess who I am.
Re:herm... (Score:2)
Additionally, apps that really need the horsepower are going to be recompiled to take advantage of the new pipelines and SIMD instructions. What is remarkable is that this new CPU design can even keep up with pre-existing apps without recompilation.
As for having to buy new boards, cases etc.. Do you really think that the majority of CPU purchases are made by people who would even think about opening up their cases? Though I don't have numbers, it is my understanding that businesses are the primary purchaser of computers; especially servers (which is currently the only thing P4s are worthwhile in right now). They don't upgrade; they get all new machines. So the fact that the power supplies are different are irrelevant. Even in light of the fact that they'll be marginally more expensive because of their newness.
Personally I'm not satisfied with the P4. But that probably doesn't really interest Intel too much. They have the highest clock speed and will soon have some of the highest benchmarks.. And brilliant IT people will see these numbers and hurd.
-Michael
Re:herm... (Score:2)
Now *that* is a sight to see: "Hurd" and "brilliant" in the same sentence! :)
The problem with ever smaller chips (Score:2)
Having said that, I am sure our new elected president will delegate onto some smart people the task of figuring out, either cloning or chips, wichever comes first.
Also, on the same subject, in Europe they finaly figured out the Royal Family's behaviour for the last couple of centuries, it's a human strain for the Mad Cow Desease.
Re:Hmmm (Score:2)
Re:Marketing is not engineering (Score:2)
Maybe not, but cost analysis is engineering. Don't believe me? Next time your manager asks you to outline your approach to a new problem, present him with something that requires 10,000 developers and a $60 billion equipment expenditure.
Yes boss, our next server should use the Hoover dam as a power supply, and hand-wound relays instead of transistors for the processor core. Actually, I'd kind of like to see that...
Re:Sure (Score:2)
Re:What is the DEAL with CPU cache!? (Score:2)
herm... (Score:2)
Re:DO PEOPLE STILL EVEN USE INTEL CHIPS THESE DAYS (Score:2)
Marketing is not engineering (Score:2)
The first reason is: nearly all high tech projects start with rosy goals and then reconsider when they know exactly what is essential and what is feasible. So all this crap about "we wanted to do it better" is pure marketing. If they really could do it, they would, because they need something to fight Athlon.
The second reason is: the article does not tell anything about the compromises necessary to reach high-mhz for the sake of marketing.
And the third reason is: the article does not even hint at the possibility that P4 might have been castrated to not appear much better than Itanium/McKinley in floating point.
A simple question... (Score:2)
Other limits exist within the architecture (Score:2)
First, the lack of registers in the x86 architecture. Having a fast cache is great, but it's not as fast as a register, and it takes extra instructions to load and store, unless you go to the more complicated addressing modes, with the problems that you note.
Second is the relatively finer granularity of the instructions available on a RISC architecture. Although there is some merit to making decisions based on information only available at runtime, that isn't a big factor with today's technology. What a modern x86 looks like is a microcode architecture with somewhat intelligent scheduling of the instructions. In most cases a compiler could do a better job.
Where microcode might really be nice is in mitigating the effects of optimizing for one single processor. You could have write-once, run-optimally-on-any-x86 code, but as we see from the real world, that's working about as well as write-once, run-anywhere is with Java.
I completely agree with you on the importance of higher level parallelism. In most cases, the instruction level parallelism in code is low. SIMD in particular seems a waste to me, since one of the few things less likely than code that doesn't have dependencies is identical code operating on different values that doesn't have dependencies. It has its places, but not that many of them. With all the indirections in neat object code, you get a lot of cache incoherency, so you don't even take as bad a hit as you might think from running threads in parallel on account of that.
Of course, there are alternatives. Running multiple branches so that a wrong prediction won't stall is a decent, although less efficient, use of extra execution units also.
P5 - The Art of War (Score:2)
The solution? Balancing the extreme power and cooling requirements of the processor by making the computer made out of PURE DEATH!!!
NeoNecroElectrical Engineering is the future...
Re:A simple question... (Score:2)
The only thing that matters is Quake benchmarks.
Keep telling yourself that, or we'll send you to the re-education camp.
Not much performance decrease (according to articl (Score:3)
The real reason for the chip's inherent "performance losses" is the running-string that's slowly being pulled to its breaking point -- that is, the x86 architecture. Hopefully Itanium will change all that.
More info (Score:3)
PPC (Score:4)
Re:Clock (Score:4)
A friend told me that in the early days of portable transistor radios, some manufacturers would intentionally add non-functional transistors to their radios, just so they could advertise them as N transistor radios, where larger values of N were "better".
Next breakthrough (Score:4)
Maybe it's me, but I can't think of a simliar 'breakthrough' advance in recent years. I remember reading somewhere that computers are approaching the 'limits' of current architecture design - we can only crank out so much from today's motherboard/x386 technology. I know that optical computing is slated as the next wave, but I can't help thinking that to bring this to light there needs to be a new "Apple I" breakthrough. Am I off base here?
The real limits, IMO. (Score:5)
Actually, while inconvenient, the x86 architecture isn't as horrible a limit to performance as a lot of people seem to be assuming. The main problem is the extra latency in the decode stage, which lengthens the pipeline somewhat, but the P4's trace cache takes care of that.
The real problem with the P4 is that it has very wierd optimization requirements (the whole "bundle" thing) and so needs a very smart compiler if code is to run quickly. Generally, even if compilers like this exist, they aren't used (remember the original Pentium?).
The other problem with the P4 is the long pipeline, which exacerbates stall problems.
As for architecure in general, heat issues are what's limiting clock speeds (for x86 and non-x86 processors alike). However, the main limit people are noticing is the limit to the number of instructions you can run in parallel. As long as you're executing only one thread, you're not going to be able to sqeeze more operations per clock beyond a certain point. The "performance problem" isn't with clock speed - it's with people expecting new chips to do more, clock for clock, than old chips while running serial programs. This parallelism problem affects all chips - x86 and non-x86.
This is why the major manufacturers are starting to look at SMT chips (Symmetric Multi-Threading) seriously. Running multiple threads in parallel on one chip doesn't take much extra hardware, and makes it *much* easier to schedule concurrent instructions and to keep on running when one instruction stalls (your "Instruction-Level Parallelism" goes up in proportion to the number of threads).
The P4 is the world's fastest microprocessor. (Score:5)
On what am I basing this apparently heretical statement? On SPEC_CPU2000 [spec.org], the most demanding, well balanced, most respected cross-platform CPU benchmark in the world. As you can see if you peruse these lists, the P4/1500 has the highest scores of any shipped CPU in the world, both in SPECint [spec.org] (base and peak) and in SPECfp [spec.org] (base only).
Before any of you reply and think you've caught a mistake, the Alpha EV67/833 is *not* publicly available, and won't be until January, at which point it will take back leadership in SPECfp_base and SPECint_peak. Of course, the P4/1700 will probably take back the lead when it's released in March or so. Indeed, the P4 and Alpha will likely trade the top SPEC spot back and forth at least until the EV68 (EV67 moved to
This is why all this banal talk about the P4 being a crappy chip or (in the wake of this article) a "crippled" chip is ignorant drivel. SPEC_CPU is an exceptionally well designed, balanced, and comprehensive benchmark stressing a CPU to its limits in all sorts of ways. Why then the P4's disappointing performance on all those other benchmarks? They are all on "legacy" code--code compiled with the P6 core in mind. Because the P4 represents the first chip with a new core architecture (the horribly misnamed "NetBurst" core) from Intel in 5 years, it has a lot of pretty radical design features which don't take well to code compiled for the P6 core. While this means the P4 is pretty a useless (or at least very overpriced) solution to running today's code--and indeed, most code released for at least the next year or so--it has nothing to do with how good a *design* it has, which is ostensibly the point of this discussion. Indeed, the PPro--the first P6 core chip--posted very "disappointing" benchmarks on legacy code when *it* was released 5 years ago; many observers wrote it and the P6 core off as underperforming overdesigned wackiness from Intel. It was arguably the most successful and innovative CPU core ever. Not so incidentally, this was strongly forshadowed by its brief theft of the SPECint95 performance crown from the top Alpha of the time...
Now to dispense with the most repeated "points" we've seen thus far.
1) "This just goes to show that x86 is a dead ISA with no headroom to grow." Not the most unexpected statement to be found on
Yes, x86 is a bad ISA, and yes it presents a problem to be overcome by chip engineers. But it has been overcome and will continue to be overcome--today by taking on a decoding stage to x86 processors that turns x86 instructions to RISC-like instructions for internal operations (taken out of the critical path by the P4's trace cache), and tomorrow perhaps by dynamic recompilation software ala Transmeta, IBM's DAISY, and HP's Dynamo, techniques which are still in their infancy and *may* end up providing better-than-compiled performance even without the benefit of converting to a more optimal ISA. The other negative of the x86 ISA, namely the paucity of compiler-visable registers, is indeed a problem, although one partially aleviated by rename registers and partially by evolutionary extensions to the x86 ISA, such as SSE2, which will eventually replace much of the god-awful stack-based x87 FPU ISA.
The real question is, does the performance hit generated by sticking with x86 exceed the performance gain generated by having a much larger target market, and thus more money to spend keeping up with the latest process technology and thus getting faster clocked CPUs? The answer thus far has been a rather resounding "no"--that is, the economies of scale granted by staying x86 have meant processors which are outright faster and cost much much less.
After all, there is no doubt that were the Alpha not around 18 months behind Intel in terms of process technology, the EV67 would be much faster than the P4. On the other hand, the EV67 gets to take advantage of resources that Intel could never dream of in a mainstream chip--like a 300+mm^2 die size, extra wide memory buses, and 4-8MB L2 caches--because of the tremendous added cost. And even with all that plus what is widely acknowledged as the best CPU design team on the planet, the Alpha only manages to keep up with the P4.
Moreover, the rest of the 64-bit world--despite the same advantages as the Alpha (well, except their design team)--can barely keep up with the P3, and that's a 5 year old design. They may be available in multi-chip boxes scaling to kingdom come, but on the level of individual chips, the best that Sun, IBM, HP or MIPS has to offer is pretty lame, despite all the advantages of a RISC ISA. Of course, the same old folks will be claiming that x86 is an inherent dead end when the P4 (or whatever Intel is calling its current NetBurst core by then) scales past 4 GHz two years from now, well ahead of anyone in the RISC world. And we'll hear it again in 4 or 5 years, when Intel releases another all-new x86 core.
2. "The P4 should have left in all those features this article talks about." Uhhuh. Sure. Um...now, who would know more about this? Would that be you, having read some article on the Internet? Or would that be Intel's engineers who maybe understand the P4 core and the issues involed with these features a bit better than you, and who had the benefit of cycle-perfect simulations on dozens if not hundreds of possible P4 variants running every concievable type of code??
If there's a feature which doesn't make it into a finished CPU, it's because of one of two reasons:
1) The designers didn't think of it;
2) The designers couldn't figure a way to implement it and make it work with the rest of their design in such a way that it raised performance/cost.
Needless to say, "The designers thought of it, implemented it (which they did in this case), and it was a good feature (i.e. improved performance/cost on a majority of code), but then made a boneheaded decision not to use it," is *not* on the list.
IMO, the features listed here are all better off gone from the current P4. The only really intriguing one--another FPU--was *not* left off for die size considerations (i.e. cost): FPU's are not very big. It was left off for performance issues. You see, while "more is better" sounds like a nice philosophy, adding an extra FPU would have meant extra decoding and routing logic in the FP section of the chip. Considering Intel actually went to the considerable trouble of implementing this feature and then decided against it, it is very likely that this extra logic was in the P4's critical path. Thus while including the extra FPU would have meant extra performance/clock, it would have meant lower overall clock speeds. Obviously Intel felt the tradeoff worked better without the extra FPU than with it.
If you "disagree" with their decision, please refer to the cycle-perfect simulators which Intel has and you don't, and the P4/1500's SPECfp2000 score which is a mere, oh, 68% better than the fastest P3. Also you might note that the P4 is scaling quite well with clock speed on SPECfp, that it will spend most of its life at speeds well above 2 GHz, and that it will likely sell most (at least for the next 2 years) in combination with a memory subsystem providing *less* bandwidth than the current dual-RDRAM i850 chipset--all of which point to this being a very smart decision on Intel's part. (The reasoning is this: if the P4's FPU can already keep up quite nicely with a larger memory bandwidth, then why increase FPU power/clock when most P4's will have higher clocks and lower bandwidth to keep them fed?)
As for the features I'd like to see added to the P4 when it moves to its