Can SSE-2 Save the Pentium 4? 171
Siloh writes "Ace's hardware has posted a Floating-Point Compiler Performance Analysis which, in a nutshell, tests Intel's most important claim about the Pentium 4. "It does not reach its full potential with today's software, but with future software (including SSE-2 optimizations) it will outclass the competition". They test with Floating point benchmarks which have been recompiled on the latest Intel and MS compilers." Basically, another iteration of the question: Can the P4 dethrone the Athlon?
Re:Anyone know when M$ VC++ will support SSE2 nati (Score:1)
Tired... (Score:1)
* Of people that don't understand most of what they write, so they put all the data, instead of focusing on the important one
* Of over-verbose hardware sites that make you scan 5 pages before getting on the (rotten) beef
* Of clueless people that pretend being surprised when optimizing compilers on very specific code can get a 240% speedup.
Btw, this sort of shit reminds me of someone:
"Paul Hsieh, our local assembler guru, analyzed the assembler output of the SSE-2 optimized version of Flops. He pointed out that "some of the loops are not fully vectorized, only the lower half of the XMM octaword is being used." In other words, SSE-2 instructions which normally operate on two double precision floating point numbers are replacing the "normal" x87 instructions and are only working on one floating point number at time."
Anyone think that "Paul Hsieh" == "Bob Ababooey" ?
Cheers,
--fred
Re:Morons (Score:1)
The only sensible metric is performace at the available clock speed, which for P4 is higher than for Athlon.
If I had a CPU that achieved 5000 flops/GHz but only ran at 1 MHz, would you want it, or would you want the 1.5 GHz P4?
The problem with Intel (Score:5)
-A former Intel employee
Morons (Score:5)
bestover2.gif [aceshardware.com]
Now look at the place where the P4 shows the most improvement over the Athlon: the first data point, Flops 8, with the P4 using the Intel compiler and the Athlon using Microsoft's.
From the graph, the Pentium 4 clocks in at about 1140 flops while the Athlon gets only 900 flops.
But wait! We're forgetting something. You're running the Pentium 4 at a faster clock speed! For the love of crumbcake, normalize those values for clock speed, please!
Pentium 4: 1140 flops / 1.5 GHz = 760 flops/GHz
Athlon: 900 flops / 1.2 GHz = 750 flops/GHz
Now things are a bit more fair. Yes, with the absolute latest compiler from the maker of the processor, the Pentium 4 beats the Athlon in one of eight tests by a measly ten flops per gigahertz. With the latest compiler from some big software company, the Athlon beats the Pentium 4 in the other seven categories, hands down.
Don't believe everything you read.
p4 ddr chipsets (Score:1)
Re:ehh? Thats dumb (Score:2)
Re:Morons (Score:1)
Re:The answer is (Score:2)
huh? Rambus has much higher latencies than SDRAM. That is why P3 with PC133 SDRAM outperforms the same P3 with Rambus on most benchmars. Since you got this part wrong, I take it the rest of your post should be taken with a grain of salt as well.
___
Intel better be careful... (Score:2)
The 800Mhz Itanium has the same SpecInt performance as a 800Mhz PIII... if the 1.7Ghz P4 got only 20% faster SpecFPU performance it would match the Itanium in SpecFP performance to go with it's already 50% better SpecInt performance.
Yeah I know the Itanium is only at 800Mhz but Intel needs to keep cranking out P4s to fend off the Anthlon - they can't afford NOT to release new chips even if the 2Ghz P4 shames their new "top-o-the-line" server chip.
Sure the Merced has a better box around it and huge amounts of onboard cache, but given the same surroundings the P4 would make their VERY expensive "server" chip look pretty bad...
=tkk
Hmm. Maybe i'm missing something, but -- (Score:4)
Just seems odd that they'd pass up the opportunity for something like that. *shrug*
Re:This appears to be the typical load of slashdot (Score:2)
And the difference in processors doesn't change what you can do with the computer (while things like changing OS does). The better analogy here is Dell beating Compaq which beat IBM.
Even the suits listen when you say "This runs everything the Intel does, as well as the Intel does, for less" enough times
Steven E. Ehrbar
Re:Morons (Score:3)
It is ignorant to argue that you should normalize for clock speed. The Pentium 4's deep pipelines are present precisely so that the chip can be run at a faster clock speed than otherwise.
With the exact same technology, same fabs, you can't make the Athlon run at the same clock speed as the Pentium 4.
Re:.NET to the rescue (Score:1)
Re:.NET to the rescue (Score:1)
But a troll? Come on. (eyes roll)
.NET to the rescue (Score:5)
Consider,
This means that all you have to do to get the most out of your machine is make sure you have the
Of course, this also means that you don't need to recompile to work on any CPU that has the CLR available on it, which makes transferring to IA64 (or any other architecture) a lot easier.
_____
so much for PC's (Score:1)
Just in the main stream, how many variations are we now or soon facing?
Pentium w/ MMX is the lowest common denominator...
Intel's SSE instructions
AMD's 3D-NOW!
Aren't there separate instructions in the Athlon, like 3D-NOW2, or something?
Now we're heading towards two different x86 64 bit implentations (yes, IA-64 isn't actually x86 anymore, but since they're bolting an x86 processor onto the silicon as well, it may as well be counted as one)...
Either developers will continue as they've been doing, writing software for the lowest commmon denominator, which makes all of intel's and AMD's attempts to add features to their processors useless efforts, which ulitimately just cost us more money since they can't manufacture as many chips per wafer, or else we're going to start seeing "Windows/Pentium 4", "Windows/AMD", "Windows/64-bit AMD" and "Windows/Itanium" sections in compUSA and such....
ANd before the oblicatory comment arrives, i'll state that no, i really would not like to compile my own software, which would be possible if everything ni the computing world was open source/GPLed/etc...
32-bit FP or 80-bit FP? High end guys need more (Score:2)
The SSE instructions on the P-III operate on 32-bit float, while the x87 FPU instructions work on 80 bit floats ( You can load 32-bit, 64-bit and 80-bit floats into the FPU registers and they are all expanded to 80-bits). Intermediate FPU results are computed/stored with 80-bit values. For SSE I believe (I could be wrong) that everything is 32-bit internally and register wise.
For scientific and engineering, 32-bits of floating point (7-8 digits of precision) just doesn't cut it. Most people I know doing that kind of work on a PC (well, both of them) use the FPU but not SSE for that reason. They have apps that take days to perform a single calculation - lots of time for accumulated precision errors to become a factor.
32-bit floats are currently enough for most 3D-graphics work (at PC resolutions), and those games ^h^h^h^h^h apps are probably a bigger consideration in driving mainstream CPU development. Given that the SSE/2 instructions have multiple math units to perform ops in parallel, there has to be a big transistor savings to have less precision.
I would bet that the FPU floating point precision on those Sun, Irix, and Alpha boxes is higher than 32-bits.
-Mp
Re:32-bit FP or 80-bit FP? High end guys need mor (Score:2)
For 3d apps that's an interesting trade off: More precision at 2 data items or more throughput at 4 data items.
That still doesn't invalidate the point about precision for scientific and engineering applications, and understanding that it may be a factor in deciding what systems to run said apps on.
-Mp
Wrong Hardware (Score:2)
Bryan R.
It's A Different Thrown Now (Score:5)
Bryan R.
Thats not the question... (Score:3)
The real question is the Short lifespan on this P4. With Intel going to DDR (thank god) but changing socket types, how viable is a P4 at this point?
Even gamers think about TCO.
I think you are mistaken (Score:1)
He was referring to pipeline length, not width. In a 20 stage processor at the same clock rate, it takes longer to fill a pipeline and consequently the branch misprediction penalty is worse.
Suppose you have two processors, each at the same clock speed. One has a 5-stage pipeline, the second a 20-stage pipeline. Suppose that there is a branch every 6 instructions (which is typical). For every mispredicted branch, the first processor need only throw away 4 instructions, but the second 19. If most branches were mispredicted, it would kill the second processor.
Pipeline length and clock speed are closely related design parameters. Longer pipes allow faster clock rates (because less is done per cycle per stage), but they increase the branch misprediction penalty. Generally there is a "happy compromise" for a processor, between pipeline length and clock speed. Most recent chips have found that happy medium to be around 10 stages. The Pentium-4 is unusual in the regard that it has 20 stages. Branch prediction therefore becomes extremely important.
Long pipelines tend to benefit Floating Point code more than Integer code, because FP is more loop-intensive, and the branches are therefore more easily predicted. This is why the P4, with its extremely long pipelines, performes poorly on integer performance compared to the PIII, but well on FP.
Re:2 things (Score:2)
Strange?? (Score:1)
Re:Strange?? (Score:2)
The interesting thing will be to see how well gcc becomes optimized for the Itanium processor, since Intel's long term plans are really to push this as the future workhorse of high performance computing. Since gcc must start over from scratch with this architecture anyway, maybe it will start out more optimized than gcc for x86, which has had to work with everything from the 386 to the P4.
Re:Strange?? (Score:3)
On the other hand, GCC *does* matter for Linux. It is true that most apps run just fine on Linux compiled with GCC. But clearly newer x86 processors are becoming more specialized and there are applications where every drop of performance counts. I do large circuit simulations, and a 10% improvement could mean getting results hours sooner. For Linux to compete seriously in these areas the apps will have to be compiled with a compiler who's results can compete with what's available under win32.
Re:.NET to the rescue (Score:2)
--JRZ
Re:please.... (Score:2)
Re:It's A Different Thrown Now (Score:2)
So, in a sense, Marketing is King.
Re:excuse me but um... (Score:1)
Many people (myself included) use cheap pcs to do number crunching for scientific porposes.
Normaly I use the low end machines, like my home PC (linux duron 900), to develop and test the code I will put to run on alphas.
I haven't made any calculations, but i suppose that for poor labs with many sudents, the cost of an alpha (for example) could finance >2 "lower end" systems which are also cheaper & easier to maintain and upgrade.
Still a war worth winning (Score:1)
By the way, I run Linux and compile with g++. Does anybody know if the GNU compiler does a good job of processor-specific optimizations?
There are more uses for computers than playing games and reading Slashdot. ;-)
AlpineR
Re:what about gcc? (Score:2)
gcc 3.0 apparently has an entirely new x86 back end, but from comments I've heard it produces code that's around 5% SLOWER than the old back end... It'd be nice to see some comprehensive benchmarks of gcc 2.95 vs 3.0 though.
There's a very interesting open source SIMD compiler project (mainly focusing on MMX) at Purdue university:
http://shay.ecn.purdue.edu/~swar/Index.html [purdue.edu]
Re:This appears to be the typical load of slashdot (Score:2)
AMD is also kicking Intel's ass in Europe, and are expected to continue gaining worldwide market share (from current 20%+ to close to 30% by end of year.
Most consumers don't know enough to make a technical decision anyway - they're going to buy what's cheapest or what their college student geek son/daughter advises.
Re:Hmm. Maybe i'm missing something, but -- (Score:2)
Well, they should, and they should open-source them as well. Intel is primarily in the business of selling processors, not compilers, so getting their P4 performance optimizations into as many third-party compilers should be their top priority.
Better general compiler support for the P4 would be an effective way to compensate for its hardware inferiority to the Athlon.
Re:Morons (Score:2)
Wow, 1140 flops. With some tight code, my VIC-20 would be competitive with this!
Re:Morons (Score:2)
A better way to normalize would be bang/buck.
Re:P4 can't dethrone Athlon in Linux (Score:1)
=================================================
Re:Uh... Hemos? (Score:1)
Spelt. The word you want is spelt
Re:Uh... Hemos? (Score:1)
spelt is the past participle
spelled is the past tense.
or at least it was when I did my O-level.
Re:excuse me but um... (Score:1)
1.5 GB RAM each for $10k for neural computations.
We could have gone with Sun, Irix, or Alpha if
we wanted one machine with 2-4 processors.
I looked into it. It wasn't going to happen.
Re:One big problem (Score:2)
-- Pure FTP server [pureftpd.org] - Upgrade your FTP server to something simple and secure.
Re:32-bit FP or 80-bit FP? High end guys need mor (Score:1)
You guessed wrong. SSE2 can operate on 2 64 bit floats in parallel.
Intel on the right path? (Score:1)
AMD had [3dnow.org] a fairly large number of developers promising 3dNow! support, and seemed to be doing the "right thing" by helping developers [3dnow.net] optimize their code.
It seems Intel has picked up on this, and has made it easy to optimize for SSE-2 with their own compiler plugin for VC. I'm just curious if this breaks AMD optimizations.
This is definitely a move in the right direction for Intel, though. I don't necessarily like it though, because I'm an avid AMD fan.
Re:Problems? (Score:2)
Re:More meaningless numbers (Score:2)
Where is GCC in all of this? (Score:2)
Re:Strange?? (Score:2)
Re:excuse me but um... (Score:3)
>>>>>>>>>>>
Not everyone working on a scientific application is blessed to be in a huge project with infintely deep pockets. There are tons of college students/projects doing different types of scientific computing, and x86 provides a very good price/performance ratio for these users.
Re:P4 can't dethrone Athlon in Linux (Score:4)
Re:The answer is (Score:4)
Take a processor. It hits a branch instruction. While it is working out whether or not to take the branch, it keeps itself busy by executing instructions from one side or other of the branch. It gets it wrong, so when it realizes this, it throws away a bunch of work it has done. Hence branch misprediction it a Bad thing.
Take a second processor, with more pipelines available for instruction issue. Again makes a branch prediction. Since it has more pipelines available it is able to issue more instructions while waiting for the branch to the calculated. Again it gets it wrong, and since it has been able to issue more instructions from after the branch, more are thrown away when it realizes a misprediction has taken place.
The point it that while more instructions are thrown away, this is only because more have been issued, and therefore the fact that you have more pipelines in a new generation does not lead to that processor running slower than previous versions. The increased branch misprediction penalty can only diminish the amount of increased performance that the extra pipelines give you, and not lead to an overall speed decrease, right?
G.
excuse me but um... (Score:3)
I may sound like a troll of sorts or anti Intel, but when it comes to high end scientific engineering does anyone actually use anything outside the realms of Sun, Irix, and Alpha? Although benchmarks claim to show factual information, I've always seen them as a bit biased.
Typical PIV purchaser in my eyes: Gamer, Newbie buying preconfiged pc's. What about this end user where are the stats for the typical purchaser? Sometimes these benchmarks confuse the average person into thinking the PIV is lowly in comparison to others.
In this article we will try to answer the following three questions:
2.Can better compilers automatically create SSE-2 optimized code from simple C++ code?
3.Can Pentium 4 aware compilers boost the Pentium 4's floating-point performance past the strong FPU of the Athlon?
Again I may be off my rocker here, but most developers I've met have always customized their own machines, dual processors, other architectures, so again is it completely unbiased to say the PIV lacks? Förvirring om denna skit =\
Anyone know when M$ VC++ will support SSE2 native (Score:2)
*gets a feeling of PPro all over again*
Re:What I thought J__ was supposed to do... (Score:2)
Next, why would MS want a write-once run anywhere development environment for themselves. They're not about to build their drivers and win32 API in Java, and any apps that they build on top of them is pure C++, so all it would take is a simple recompile for the different platforms.
When Java came out, I don't believe that Alpha-NT was that popular, and SGI-NT was being dropped (not certain about the timing, but it seems about right).
I agree with you about win 9x being stepping stones, but I don't think cross-platform was a big focus for NT. Yeah they have the Hardware abstraction layer, but I don't know that this wasn't more for stability and protected code than for true platform independance. Thought it was really just a carry-over from VMS.
-Michael
Re:The answer is (Score:2)
I use to call the IA-64 the P7 just so that may lay-friends could know what I was talking about. It's VLIW / speculative execution could probably be considered a new generation. But in reality it's a completely separate product with hardly any ability to compare to the x86 line.
I think, however, that I'd recognize SMT / CMP as a next generation label.
-Michael
Re:Hold on a minute... (Score:2)
My understanding of the proposed SMT on x86 is that you simply switch to another thread when there's a memory stall. I think SPARCs have done that for a while... What I believe you're referring to is the reduction in the number of times you have to context switch and thereby flush your cache. Though it's true that having fewer distinct processes (even LWP ones) requires fewer context switches, I believe that you are not given a time-delta extension simply because you have 2 or more threads associated with a process for an SMT core. Thus, I believe the time-delta is still the same for all processes (minus HW interrupts), and the number of cache flushes per second is the same. Hence, little realized benifit.
Just for completeness, what I think you do get is fewer memory stalls within your time-delta. Additionally, if each thread is stalling, then you at least have multiple concurrent memory requests, which I believe does suite RDRAM well. You could achieve a similar situation by having multiple independant banks of SDRAM (like nVida's GeForce 3).
In summary, if anything, cache is the weak link towards multi-core / multi-threading.
Re:The answer is (Score:2)
Believe you're thinking of the number it can "issue", which is separate than the number of [semi-]independent pipes. In the PPro, some instructions (like divide) would lock other pipes or stages within it's own pipe. Issuing instructions is expensive, so it's generally accepted that you issue less than the number of pipes, but as the P4/Athlon have significantly more pipes than their predecessors, they have augmented the number of issued instructinos by 1 or so.
-Michael
Re:Morons (Score:2)
The difference is more dramatic between the P5-4 and P5-3, since you max out at about 1GHZ for the P5-3, and so I'd be inclined to believe you. The Athlon, however is not yet out of steam for its current. If it can best the P5-4 in 50% of the categories (including legacy apps.. e.g. modern ones), then the value of the P5-4 is limited, even if it can produce top-notch synthetic scores.
The point is that it is not ignorant to normalize, so long as you look at the periferal factors. It's like having taking the average, but also taking the standard deviation. You do find useful information from such numbers.
-Michael
Marketecture (Score:3)
Intel is the market leader, but they shouldn't let their marketing team design their chips!
Re:excuse me but um... (Score:2)
I do. For my master project, I've trained hundred of neural networks, each taking between an hour and 2 days to train. At my job, we're doing the same kind of stuff on linux and solaris PC's. I believe a lot of people do that too. PC's are so cheap compared to the other architecture, that it's still the best thing to buy for many types of computations.
And by the way, training a neural network requires about one division for several millions of add/mul.
Re:compiler plug-ins (Score:2)
No they do not. AMD uses Intel compilers for their SPEC scores since it is the best X86 compiler.
Re:In that case... (Score:2)
You seem to be confused. AMD has the choice of any compiler in the world to use when submitting SPEC benchmarks. They choose to use Intel's because it is the best. If Intel crippled support for AMD processors in its compiler, then AMD would use a different compiler. Of course, if AMD had compiler expertise they would develop their own compilers optimized for their chips. But they don't know how to develop compilers (and that will be quite a performance limiter for x86-64 since they will have to rely on GCC, which has terribl performance).
Re:The problem with Intel (Score:3)
You are wrong. The DP capable P4 (known as Xeon) was launched in May, and was launched well before the DP Athlon was released. Moreover, you can buy real dual Xeon systems from Dell, IBM, Compaq, and the like, yet you cannot buy a DP Athlon system from any major vendor, since no major OEM's want it.
Re:The answer is (Score:2)
Historically, if you took code for one processor and ran it on a later processor, the later processor would always do a better job of running it than the original. (The major, glaring exception to this was the Pentium Pro, which really sucked unless you optimized the code for it.) This is why Linux distributions such as Debian just optimize for the 386 and call it good -- most of the time, for most of the applications, you won't pick up very much performance by optimizing for a specific chip architecture. (By the way, you should rebuild your kernel with chip-specific optimizations. Your kernel is running all the time, and any savings will add up quickly. Of course, all the CPUs are so fast these days that few of us will really notice any difference even with the kernel.)
But now the Pentium4 has so much wrong with it, that unless you rearrange the code specially, it chokes and underperforms. The Level 1 cache is actually a cache for decoded instructions, which is cool... but it is only 8K, which is insane! Sure, since the instructions were already decoded, the 8K cache is probably worth a bit more than a simple 8K instruction cache, but the Athlon has a 64K instruction cache! The Pentium4 has all these internal execution units, but it can only feed three of them per clock cycle from the cache, so most of them will be idle in any given clock cycle. And while earlier chips introduced cool features that would make code run really fast (bit-shifting was really fast, and there were special instructions like CMOVE) these all run dog-slow on the Pentium4.
So, the Pentium4 runs really hot, and needs special cooling and a special power supply. Right now it needs expensive RDRAM. And it needs special optimizations to allow it to run at full speed. Summary: unless you really need its special features, buy an Athlon.
When does a P4 beat an Athlon? Some specific situations where RDRAM is really appropriate, some specific situations where the SSE features really work (and assuming the code is optimized for it), and that's about it.
Can a future P4 dethrone the Athlon? Maybe. Intel claims that the P4 is slower, clock-for-clock, than the Athlon for a good reason: because the P4 will reach really high clock speeds really fast. Some breathless press release I read said something about a 10 GHz version of the P4 within four years or so. Let's face it, the P4 can stay as broken as it is and still stomp the Athlon if Intel can really get the P4 going twice as fast or more than the Athlon! But I'll believe it when I see it. The current P4 goes into thermal overload and slows to half-speed if you work it really hard, and dissipates 73 Watts at 1.5 GHz; even with a die shrink I'll bet a 10 GHz P4 would melt itself into a puddle.
Because the Athlon gets more work done per clock, and is available at clock speeds nearly as high as the P4, the Athlon is better than the P4 across the board. There are a few narrow situations where the P4 is better than the Athlon, but if you check the price/performance ratio the Athlon still wins.
steveha
Compiler vs processor (Score:3)
=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=
What I thought J__ was supposed to do... (Score:2)
I assumed that the idea of J++ was for MS to have their own Java. That would give them tremendous platform independance. You would write "cross platform" Win32 code, meaning it would run natively on any MS OS. I had always expected that this was why MS bought into Java. An MS version would work on MIPS/PPC/Alpha/x86.
Given that their RISC compilers were always a gen back, this never materialized. However, shipping a semi-compiled mode would have let them become truly cross-processor. I mean, think of it as Install Shield on crack... or a BSD port...
Alex
Or Java, or any open source application... (Score:2)
And of course, as soon as GCC can take advantage of whatever the latest CPU gizmo, everyone who runs an open source OS, or application can simply simply recompile for a performance boost.
All the more reason, me thinks, for the chip vendors to help the open source compiler developers.
Thad [kuro5hin.org]
Have a look at mprime (Score:2)
Re:Hmm. Maybe i'm missing something, but -- (Score:2)
Compatible with Microsoft* Visual C++* and Visual Studio*, the Intel® C++ Compiler is designed from the silicon up to let developers easily take advantage of the performance and features of the latest Intel® architecture, including the Pentium® 4 processor.
Intel is committed to customer support. See www.intel.com/software/products/prodsupport.htm for further information on product support.
Windows*NT*/98/2000 Full Product Electronic Delivery $399.00
Windows*NT*/98/2000 Full Product CD Delivery $499.00
Windows*NT*/98/2000 Upgrade Product Electronic Delivery $175.00
Windows*NT*/98/2000 Upgrade Product CD Delivery $275.00
Intel® Compilers for Linux* Field Test Intel® Compilers for Linux, field test versions, are available for download only. No CDROM versions are available. Not all of the GNU C language extensions, including the GNU inline assembly format, are currently supported and, due to this, one cannot build the Linux kernel with the beta release of the Intel compilers and the initial product release. The C language implementation is compatible with the GNU C compiler, gcc, and one can link C language objects files built with gcc to build applications. However, the C++ implementation uses a different object model than the GNU C++ compiler, g++, and due to this, C++ applications cannot use C++ object files compiled by g++. For further details, see the FAQs on the support site. Before using the compiler, we recommend you read Optimizing Applications with the Intel® C++ and Fortran Compilers for Linux to learn about the appropriate optimization switches for your application. You should have received the invitation letter that explains how to get started using the Intel compilers for Linux. All support issues, compiler updates, FAQ's and support information will only be available when you register for an account on the Intel Premier Support site. Please register for a support account at http://support.intel.com/support/go/linux/compile
Click Here! [releasesoftware.com]
CPU-specific optimisations (Score:4)
As CPU designs get more complex the compliers need to know more ane more about the exact nature of the CPU. Despite the lable of binary compatability given to the CPUs from AMD (and others), those who need to squeeze the best performance out of machines are going to need to run code that is complied for their specific machine. Despite the best efforts of the open source community most end users do not want to recomplile source, let alone spend time finding obscure /QaxW flags to make the most of the system. Really this should be a job for the OS.
Maybe in the future we will see commercial code being distributed in such a way that parts of the code are compiled on the destination machine as the code gets installed. That way the code vendor can test a variety of complier options and not have to ship 42 different binaries for all the different CPUs in use.
P4 can't dethrone Athlon in Linux (Score:2)
By next year when many programs are SSE2 enabled, AMD clawhammer should take back any lead Intel gets, because it use SSE2 as well.
Re:This is nothing new... (Score:2)
Re:Morons (Score:2)
The are processors (UltraSparc III) where the core pipeline is not clocked (called wave pipelining). There are caches that are double pumped; they do work on each edge of the clock instead of only latching on one edge.
And an even clearer fact: different processors do different amounts of work per edge of the clock. If you want a _really_ high clock rate, put only one gate between each latch. That clock rate would be obscene. But half of the work done would be latching the values (assuming you could distribute the clock over so large an area).
If you want to normalize anything, normalize over price. Unless you have stupid friends and compete over having the highest clock.
Oh ya. Don't bother talking about FLOPS or MIPS. You'll just end up sounding stupid (and you need all the help you can get). Any benchmark not targetted to YOUR specific application is next to worthless.
Heh, some processors don't even bother to dispatch NOPs. With a little hackert, they could ``execute'' as many NOPs per clock as the depth of their dependancy issue window.
Re:One big problem (Score:2)
Wow. You sound like a really smart guy. I bet you can think of all sorts of reasons why BSD is dying. Why don't you share some with the slashdot community?
Re:Anyone know when M$ VC++ will support SSE2 nati (Score:2)
Somehow I'm not shocked.
/Brian
This appears to be the typical load of slashdot bs (Score:3)
But back to the real world: If you turn on a computer out there in happy fun land (aka "The Real World"(TM)) then odds are it will be a running Intel. Linux your precious kernel started out with optimized non-portable code for the i386. You geeks keep falling victim to the same trap year after year... just because it's better doesn't mean people will use it. Linux/BSD/Solaris/Irix/SVR4/MACOS/BeOS... is clearly better than Windows when you look at a track record... and yes in some cases can be *almost* as easy to use... but they have been winning the OS war since they made a *bad* ripoff of the Macintosh (read Xerox) GUI OS. MacOS was better, more stable, and quite cleaner... but Micro$oft had the market share and they won. People listen to money, and Intel is still the processor most people/companies would prefer buying. Hackers are one of the lowest demographics in the computing industry these days and people (outside of their community) don't pay much attention to them.
Well, I guess thats it... go ahead return to your illusion and mod this down.
Re:The answer is (Score:5)
Rumours have it that PentiumIV will have Simultaneous Multithreading(SMT) enabled, which let's the processoor run any instruction from any thread on any unit at any time. Supposedly this feature was allready included in current processor designs but not enabled because the P6-4 is not ready for SMP yet.
AMD uses On-chip Multiprocessing(CMP) in Sledgehammer, which is basicly the sames as subdividing the resources of the cpu (registers & units) between the threads. The benefit of this technique is that the design can be kept simpeler and the clock can go faster than a similar monolithic chip with the same resources. On the other hand, a lot of resources are wasted if only one thread is operational in this setup.
Needless to say, SMT has some problems too, for example, CMP lends it self much better for branch prediction through Slipstreaming than SMT does. You can find some good reading in this previous slashpost [systemlogic.net] about how intel and amd deal with multithreading on their single/multiprocessor designs. To be taken with a bit of salt of course, but very sharp.
My point is that if branch prediction in the form of Slipstreaming is implemented (and Jackson Technology seems to be that kind of SMT), the P6-4 problems with the excessive cache flushing are completely over, and SMT can take full advantage of the smaller RAMBUS latencies, easily outperforming a similar CMP setup like AMD has.
Re:Isn't the Pentium 4 in the P7 family? (Score:2)
Here's a good analogy.. (Score:2)
One thing that you forgot - it takes more time to go back and run the other branch if there is a longer pipeline. Hence, a CPU with a long pipeline will sit there idle as the data makes it's way through the pipeline.
To better visualize how a pipeline works I like to think of this little analogy:
Have a line of people passing buckets of water from a well to a burning house. Given that every person works at a given speed, it requires them a defined amount of time to move the water from one person to the next. The more people present, the smaller the distance required to move the water. This allows them to move more buckets in the same amount of time (or operate at a higher frequency - just like the P4.) The problem is it takes longer for the water to actually get to the fire (assuming 20 vs 10 people working at the same frequency.) Now lets say there are two different kinds of water (a very hypothetical situation.) Should the wrong type of water be sent and arrive at the house, the guy at the house would have to tell the guy at the well to send over the correct type of water. Now with more guys in between the two, it'll take longer for the correct water to get to the house. While the water is in transit - the guy at the house sits wasting his time.
So as you can see - more people increases the potential speed. The speed determines the volume of water being sent. This is great but if the wrong thing is sent it takes a long time because the correct "thing" has to travel through the enlarged pipeline.
A long pipeline is great if you're running code that doesn't have a pile of "branch if" instructions in it. Performing an "add" on every byte in a 4MB file (think Photoshop) will result in very efficient use of the CPU. However, if you're running code with lots of "if then" statements then you run the risk of wasting a great deal of CPU time. This is where a smaller pipeline helps (or should I say doesn't cause as much damage.)
The other big problem with a large pipeline is that it greatly increases the complexity of the chip design. More transisters result in the components of the CPU getting spread further apart - hence you need an even longer pipeline (think of the buring house example - the house just moved an extra block away from the well.)
Overall, chips with smaller pipelines offer far greater efficiency. Look at a G3 PPC CPU. It has a 4 stage pipeline. Because of this it maxes out at 700MHz but it is faster then a PIII when comparing MHz to MHz. All this and it's a third of the die size and typically uses only 5Watts. You can also look at the Alpha with it's 7 stage pipeline. It might not operate as fast (MHz) as todays P4s or Athlons but it still offers incredible performance.
The real advantage of the P4 will come with multimedia type applications. The problem is that it will quickly max out the memory bandwith. Now take an Athlon - it might not be quite as good for those same apps but so long as it can also max out memory bandwith you're not going to see a difference. As John Cormack (did I spell that right?) said in a receint /. posting - the new G4 is great but the main problem is memory bandwith. As CPUs double in speed this will become an even greater problem.
Willy
Re:ehh? Thats dumb (Score:2)
Having a smaller engine with more HP/Litre allows you to have a smaller (lighter) car. The reduction in torque becomes much less significant since the engine has less mass to accelerate. If GM made an small 4-cylinder engine with only 61 HP/Litre, and put it in a car the size of the S2000, it'd be terribly slow and wouldn't compete too well.
If Honda made a 5.7L V8 engine, it probably wouldn't scale linearly, but I'm sure they can easily do better than 61 HP/Litre. Why haven't they? They probably 1) aren't interested in large V8 engined-cars, or 2) don't feel that it'd be a profitable market segment for them to enter, especially given their reputation for small, lightweight cars. They are planning to make a V8 NSX soon, though, although it'll probably be more like 4.0L.
In the end, I guess it comes down to which kind of engine you prefer in a car, and in what kind of car: a small car with a small, high-revving engine (but not much torque), or a larger, heavier car with a large, powerful engine which concentrates more on low-end torque. If you like lots of torque, Honda probably isn't the company for you to be buying from.
Re:It's A Different Thrown[sic] Now (Score:5)
Cheap chips rule in a soft market and AMD has demonstrated the ability to produce wicked fast at cheap prices. This would seem to be the best evidence yet that Intel has lost it's way and the bureaucracy is in need of some serious house cleaning.
Some blunders:
Tying themselves legally to Rambus
Talk of discontinuing the P3, their best mover.
Pushing the 1.13GHz P3 out the door before it was ready and suffering the consequences.
Slashing prices and subsidizing RDRAM just to move P4 product.
The P4 may have some advantages, but imagine what it would be like if AMD had rolled it out... um hm.. It would have killed the Athlon alright, assuming the Athlon were Intel's. ;-)
The truth is out there. [faceintel.com]
-- .sig are belong to us!
All your
Intel Compiler costs $???? (Score:4)
"Oh look, all of these games are optimized for Intel chips. They must be good!"
Better yet, if they want their cpus to get on top of the server market, they should be releasing the source code for their compiler as well. This would let the gcc crew use the optimizations in their compiler creating better/faster *nix software. (Unix being the server platform of choice for more large companies I've worked with than I can shake a stick at. I won't get into why, as that will probably start a small war.)
Bottom line, make the compiler free, and open the source, and Intel would definitely take off again.
Until that day, though, I will stick with AMD since they have better prices for equal performance.
Guys!! (Score:2)
http://www.dell.com/html/us/segments/dhs/intel_am
But go to a different page before you paste it in, so that they won't know we're all coming from slashdot.
~
Re:Isn't the Pentium 4 in the P7 family? (Score:2)
in the summer of 2000 it tried to push the aging "P6" architecture too far. The P6 design, or 6th generation of x86 processor which since 1996 has been the heart of all Pentium Pro, Pentium II, Celeron, and Pentium III processors, simply does not scale well above 1 GHz. As the aborted 1.13 GHz Pentium III launch this summer showed, Intel tried to overclock an aging architecture without doing thorough enough testing to make sure it would work. The chip was recalled on the day of the launch, costing Intel, and costing computer manufacturers such as DELL millions of dollars in lost sales as speed conscious users migrated to the faster AMD Athlon.
From the article I linked before.
 _
Re:The answer is (Score:2)
Thanks for the info.
 _
Re:The answer is (Score:2)
Typical instructions take more clock cycles to execute in Pentium 4(not P4). Longer and more pipelines doesn't mean more instructions can be fed and excuted in one clock cycle. Also the longer pipeline used in the Pentium 4, flow control operations (such as branches, jumps, and calls), the longer the time needed to fill up the pipelines.
(reminder: it's very simplifed view)in theory the execution units can process 9 micro-opts per clock cycle, thanks to the problem in cache design, it can only feed 3 micro-opts per clock cycle.
Pentium III's decoder can feed up to 3 instructions and 6 micro-ops (4+1+1) to the core per clock cycle.
Pentium III is like a motorcycle engine in a motorcycle. Pentium 4 is like upgrading the same engine to run a bus.(just ignore it if you think the analogy is wrong ^_^)
I might miss some points. Please comment.
 _
Re:Morons (Score:2)
How about this [aceshardware.com], this [aceshardware.com] and this [aceshardware.com]?
Don't believe everything you read.
Assumed you believe everything in aceshardware, do you believe the graphs above?
 _
The answer is (Score:5)
No.
Let me explain this way: Pentium III has 6 10-stage pipelines for out-of-order superscaler execution, while Pentium 4(avoid using short-form P4 - Pentium 4 is in P6 family) has 9 20-stage pipelines.
More pipelines more stages sounds good huh? Unfortunately, in some benchmark tests Pentium III beats Pentium 4, it's due to the fact that Pentium will flush the entire pipelines during branch-misprediction/pipelines stall. As a result Pentium III would out-perform Pentium 4 in some occasion, as the latter tends to lose more instructions when branch-misprediction rate is too high.
Althon, on the other hand, only flush 1/2 pipelines on averages. They really need to fix this fundamental design glitch before they could beat Althon.
If you are very interested in this subject you can read this article [emulators.com]. You can understand why Intel cannot giveup Pentium III in favour of the market of Pentium 4.
 _
Re:One big problem (Score:2)
Microsoft's current compilers, while inferior to Intel's on the new Intel processors, are better with Pentium Pro style architecture than gcc, but that's just because of different development goals (gcc tries to serve everyone, Microsoft can focus on a much more limited set of CPUs). Its not a grand conspiracy or anything.
Re:Intel should.. (Score:2)
Nevertheless, like I said, I'd be shocked to see Intel open source their compiler...But I wouldn't be shocked (and I think it makes a lot of sense for them to do this) if they started giving away the Win32 binary for free (as in beer). Otherwise the majority of developers are going to keep using Visual C++ and/or Cygwin/gcc and Intel's chips are going to continue to look inferior to AMD's, even if that view is not entirely accurate.
Re:CPU-specific optimisations (Score:3)
Intel should.. (Score:5)
I won't even get into the argument about how it might help them to Open Source the thing so that parts of the technology might be rolled into other compilers like gcc, because I just can't imagine that happening anytime soon.
More meaningless numbers (Score:4)
As for the SSE extensions, Intel tried this first back with MMX, and Apple is trying it now with AltiVec(sp?). Yes these extension can help, but only after software is optimized for them. It not a case of "drop 'em in and watch out!" It takes time to develop.
Of course, all of this is just marketing. Kinda like the MHz wars. Intel needs some positive press after that oft quoted test where the P3 trounced the P4.
Intel Complier's Athlon Optimization (Score:2)
putz;
putz;
putz;
wait(ALONGTIME);
putz;
putz;
putz;
do(operation);
putz;
}
Using this processor specific optimixation for the Athlon chips, the Pentium 4 has managed to outrun the Athlon. Intel's compiler cannot be expected (realistically) to generated optimized code for the Athlon. Any of their comparisons based on their compiler should be highly suspect.
Re:Hmm. Maybe i'm missing something, but -- (Score:5)
what about gcc? (Score:2)
Interesting results! Looks like heavily optimizing one's compiler pays huge dividends in terms of processing power.
There's an important question though. The article used the MS compilers exclusively, with the best results coming from the Intel plug-ins - since these are apparently the industry standards. However, I'm at a university, and everybody I know is using gcc. We would be very interested in the kind of performance that is displayed here. Does gcc keep rigorously up to date with the most modern CPU technology, or does it lag (and if so, how much)? How long until these optimizations will appear in a release of gcc?
One big problem (Score:2)
After one full week of testing we found the problem wasn't with BSD at all, it was with the P4 on BSD. It would seem Intel has an enhanced instruction set cache which is only available on Microsoft compilers. This is not a trivila thing to implement so I doubt the OSS camp will be able to migrate it into their compilers anytime soon.
processors heading in the wrong direction.... (Score:2)
To me, it seems like we're moving toward a time where there will be different versions of o/s's for each processor. (myos for intel / myos for amd) It's going to be increasingly hard for vendors to be able to write code that will be optimized for all processors.
Anyone else think this way? does this make sense?
Short Answer (Score:3)
-------------------
Re:Short Answer (Score:2)