Ok sure, go ahead and run a 8k x 8k Linpack, tell me how that goes and how non-limited you are.
I guess if 8kx8k Linpack refers to matrix multiplication, I'm pretty sure that the Cell will perform at close to its 200GFLOP/s performance. Matrix multiplication can be really well broken down into block-matrix multiplications that nicely fit in the 256KB of available SRAM. Data transfer cost per block grows O(N^2), while FLOPs grow O(N^3). With 64x64 block matrixes you have to transfer 2*64*64 floats while having to compute 32 times as many multiply&adds. With 8 SPUs sharing the single XDR RAM you'll only have bandwidth to transfer 1/4 float per cycle and core. However, corresponding to that transfer rate, the core has to compute 32*1/4=8 madds per cycle, which happens to be twice the theoretical peak performance of the core. So no bandwidth problem at all. You'll be able to increase block size to 96x96 if you want to reduce bandwidth per FLOP even further. this paper claims close to peak performance for 2304x2304 sized matrices.
Sorry man, the Cell is fine for some things but the idea that it doesn't face the same realistic limits other hardware does is silly. You can talk all you like about high speed stuff on the cache, but that applies only for things that'll fit in there. When you have larger problem sets
that have to go back and forth to main memory a lot, I'm afraid it isn't so fast.
You'll have these kind of losses on any architecture, including GPUs. What i'm saying is that the Cell is designed so that you'll almost always be able to reach close to 100% utilization of the available peak FLOP/s. That is something that you won't ever normally achieve on a GPU. On the other hand, GPUs have a much higher peak FLOP/s so loosing half of that due to bandwidth problem seems to be more acceptible.
Regardless my point was simply how unrealistic it is to call the thing a supercomputer. If a couple hundred GFLOPS makes a supercomputer then my GPU is a supercomputer.
Don't downplay what GPUs can do computationally either. They are the kings of Folding, yes ahead of the PS3. So long as your problem meets some requirements (highly parallel, single precision FP, fits in to GPU memory, not a lot of branching and when it branches everything branches the same direction) they scream. Is that all things? No, certainly not, you can find things they drag ass on. However the same happens with the Cell when compared to something like a Core i7. For some things, due to the SPEs the Cell is faster, however for others, due to constraints of the PPE it is slower.
Well it's not the constraints on the PPE that make the Cell slow. The Cell is just as fast as 200GPLOP/s can be. Modern GPUs are much faster than that, even if they cannot utilize all their horsepower. The downside is that GPUs rely on very different programming&compiler paradigms, where Cell was designed to mostly use multi-core, shared memory pradigms and standard C compilers with SIMD extensions, everything nicely managed by a standard MMU-utilizing operating system.
It is an interesting architecture and useful for some things, but it is not particularly impressive compared to other modern processors. Doesn't mean it is worthless, just that it is not "OMG this is so fast!".
It was that fast when it came out. Unfortunately development of the successors was cancelled by IBM, probably due to GPGPU programming eating its lunch.