Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror

Comment Re:OMFG (Score 1) 231

These are all good points. I (as in "I who wrote the paper and presented the slides") did measure power and for LINPACK you do hit TDP. See my other publications. And unfortunately, we don't get to choose voltage-frequency point neither does AMD, Intel, nor NVIDIA with such flexibility. Operating voltage starte at 5V and now it is at 1V. Silicon junction switches at 0.7V and the closer you get to 0.7V the less reliable the junction is (that's why it once was 5V). So you have about 0.3V max in terms of voltage. And frequency is capped at 4 GHz due to the voltage problem. So you have to live somewhere between 1 GHz and 4 GHz. Lookup Dennard scaling and its demise for details of voltage, frequency and area scaling. I only make presentations about iPad apps so don't know much about hardware ;-)

Comment Very bad article (Score 3, Interesting) 396

This is a very poor quality article, I analyzed it before. There are possibly better ones mentioned by others.

Just look at the matrix multiplication case. Look at the graph and see that 1000x1000 takes 30 seconds on CPU and 7 seconds on GPU. Let's translate it to Millions of operations per second: CPU -> 33 Mop/s, GPU -> 142 Mop/s Matrix multiplication has cubic complexity so for CPU: 1000 * 1000 * 1000 / 7 seconds / 1000000 = 33 Mop/s

Now think a while: 33 million operations on 1.5 GHz Pentium 4 with SSE (I assume there is no SSE2). Pentium 4 has fuse multiply-add unit which makes it do two ops per clock. So we get 3 billion ops per second peak performance! What they claim is that the CPU is 100 times slower for matrix multiply. That is unlikely. You can get 2/3 of peak on Pentium 4. Just look at ATLAS or FLAME projects. If you use one of these projects you can multiply 1000 matrix in half a second: 14 times faster than the quoted GPU.

Another thing is the floating point arithmetic. GPU uses 32-bit numbers (at most). This is too small for most scientific codes. CPU can do 64-bits. Also, if you use 32-bits on CPU it will be 4 times as fast as 64-bit (SSE extension). So in 32-bit mode, Pentium 4 is 28 times faster than the quoted GPU.

Finally, the length of the program. The reason matrix multiply was chosen is becuase it can be encoded in very short code - three simple loops. This fits well with 128-instruction vertex code length. You don't have to keep reloading the code. For more challenging codes it will exceed allowed vertex code length. The three loop matrix multiply implementation stresses memory bandwidth. And CPU has MB/s and GPU has GB/s. No wonder GPU wins. But I can guess that without making any tests.

Slashdot Top Deals

Disraeli was pretty close: actually, there are Lies, Damn lies, Statistics, Benchmarks, and Delivery dates.

Working...