dvdkhlng - Slashdot User

Comment Re:Great-grandson of "Cheap Video Cookbook" (Score 3, Informative) 77

by dvdkhlng on Saturday May 07, 2011 @12:34PM (#36056860) Attached to: Micro-SD Card Slot Abused As VGA-Port

Submission + - Micro-SD Card Slot Abused as VGA-Port (qi-hardware.com)

Submitted by dvdkhlng on Saturday May 07, 2011 @05:45AM

Comment Re:Oh stop with the supercomputer bullshit (Score 1) 240

by dvdkhlng on Thursday May 05, 2011 @06:00PM (#36041798) Attached to: Gitbrew Releases OtherOS++ PS3 Linux Dual Boot

Ok sure, go ahead and run a 8k x 8k Linpack, tell me how that goes and how non-limited you are.

I guess if 8kx8k Linpack refers to matrix multiplication, I'm pretty sure that the Cell will perform at close to its 200GFLOP/s performance. Matrix multiplication can be really well broken down into block-matrix multiplications that nicely fit in the 256KB of available SRAM. Data transfer cost per block grows O(N^2), while FLOPs grow O(N^3). With 64x64 block matrixes you have to transfer 2*64*64 floats while having to compute 32 times as many multiply&adds. With 8 SPUs sharing the single XDR RAM you'll only have bandwidth to transfer 1/4 float per cycle and core. However, corresponding to that transfer rate, the core has to compute 32*1/4=8 madds per cycle, which happens to be twice the theoretical peak performance of the core. So no bandwidth problem at all. You'll be able to increase block size to 96x96 if you want to reduce bandwidth per FLOP even further. this paper claims close to peak performance for 2304x2304 sized matrices.

Sorry man, the Cell is fine for some things but the idea that it doesn't face the same realistic limits other hardware does is silly. You can talk all you like about high speed stuff on the cache, but that applies only for things that'll fit in there. When you have larger problem sets that have to go back and forth to main memory a lot, I'm afraid it isn't so fast.

You'll have these kind of losses on any architecture, including GPUs. What i'm saying is that the Cell is designed so that you'll almost always be able to reach close to 100% utilization of the available peak FLOP/s. That is something that you won't ever normally achieve on a GPU. On the other hand, GPUs have a much higher peak FLOP/s so loosing half of that due to bandwidth problem seems to be more acceptible.

Regardless my point was simply how unrealistic it is to call the thing a supercomputer. If a couple hundred GFLOPS makes a supercomputer then my GPU is a supercomputer.

Don't downplay what GPUs can do computationally either. They are the kings of Folding, yes ahead of the PS3. So long as your problem meets some requirements (highly parallel, single precision FP, fits in to GPU memory, not a lot of branching and when it branches everything branches the same direction) they scream. Is that all things? No, certainly not, you can find things they drag ass on. However the same happens with the Cell when compared to something like a Core i7. For some things, due to the SPEs the Cell is faster, however for others, due to constraints of the PPE it is slower.

Well it's not the constraints on the PPE that make the Cell slow. The Cell is just as fast as 200GPLOP/s can be. Modern GPUs are much faster than that, even if they cannot utilize all their horsepower. The downside is that GPUs rely on very different programming&compiler paradigms, where Cell was designed to mostly use multi-core, shared memory pradigms and standard C compilers with SIMD extensions, everything nicely managed by a standard MMU-utilizing operating system.

It is an interesting architecture and useful for some things, but it is not particularly impressive compared to other modern processors. Doesn't mean it is worthless, just that it is not "OMG this is so fast!".

It was that fast when it came out. Unfortunately development of the successors was cancelled by IBM, probably due to GPGPU programming eating its lunch.

Comment Re:Oh stop with the supercomputer bullshit (Score 5, Informative) 240

by dvdkhlng on Thursday May 05, 2011 @10:46AM (#36035236) Attached to: Gitbrew Releases OtherOS++ PS3 Linux Dual Boot

The best they claim is 25.6 GFLOPS per cell in theoretical performance, so 205 GFLOPS is the best you theoretically get, if there are no bandwidth constraints (which there are on a PS3) for single precision math. Ok well testing my actual Radeon 5870, I get 800 GFLOPS for single precision, 227 for double precision. That is an actual benchmark of the card running on my desktop.

As somebody who programmed Cell CPUs for signal processing (including to, but not limited to PS3s), let me tell you that the PS3's memory bandwidth is so close to unlimited, that you usually don't have to think about it. At least as long as you move data only on the Element Interconnect Bus, between the 256KB local SRAMs of each CELL core, which is sufficient for most of what I did. It moves up to 200 giga bytes per second, maximum 16 bytes per 2 cycles in and out per core. The DMA engines that do those transfer have their own 1024bit (!) read/write port into the SRAM, so they burst 128 bytes per cycle into the SRAM, and don't have to steel many RAM cycles. The wikidedia article has more details.

In my experience, you can usually come pretty close to the 200 GFLOP/s of the Cell-CPU. When relying on C-Compiler with SIMD intrinsics, you usually manage 100 GPFOP/s for algorithms that have as many read/write opcodes as arithmetic opcodes. Smaller problems can mostly be handled on registers only (per CPU we have 128 16-byte registers!) and will run even faster.

Also note that many algorithms nowadays are not bandwidth but memory latency limited. Having the Cell's per-core DMA engines do background transfers to large local S-RAMs, mostly eliminates these latency problems and is much cleaner than relying on CPU caches guessing what parts of RAM to prefetch next. BTW these are user-space DMA engines that undergo page translation and are fully compatible to unix vm concepts. Still programming directly accesses DMA registers and doesn't need any kernel calls.

Try to do that with your GPU!

Comment Re:I'll be first to say WTF (Score 1) 700

by dvdkhlng on Thursday January 20, 2011 @04:41PM (#34944114) Attached to: Polynomial Time Code For 3-SAT Released, P==NP

Comment Re:Probably Wrong but Clearly Falsifiable (Score 5, Interesting) 700

by dvdkhlng on Thursday January 20, 2011 @02:12PM (#34942194) Attached to: Polynomial Time Code For 3-SAT Released, P==NP

Comment Re:Great-grandson of "Cheap Video Cookbook" (Score 3, Informative) 77

Submission + - Micro-SD Card Slot Abused as VGA-Port (qi-hardware.com)

Comment Re:Oh stop with the supercomputer bullshit (Score 1) 240

Comment Re:Oh stop with the supercomputer bullshit (Score 5, Informative) 240

Comment Re:At some point, it's just bashing... (Score 1) 120

Comment Re:It's all about maths, you insensitive clod! (Score 1) 448

Comment Re:Probably Wrong but Clearly Falsifiable (Score 1) 700

Comment Re:Probably Wrong but Clearly Falsifiable (Score 1) 700

Comment Re:Probably Wrong but Clearly Falsifiable (Score 2) 700

Comment Re:I'll be first to say WTF (Score 1) 700

Comment Re:Probably Wrong but Clearly Falsifiable (Score 5, Interesting) 700

Slashdot Top Deals

Slashdot