Can it do anything my Blackberry can't?
It runs the software that you write, and you don't even need any SDK for that. Out of the box it runs Lua, Python, Tcl, Octave, Scheme, gForth, Emacs-Lisp, Shell Script and who knows what else. There's even a GCC toolchain package available, if you need it. If you're satisfied with the software that vendors throw at you or allow you to obtain via their managed app-store, than maybe NanoNote is not made for you.
[..] To this day, the only thing I find lacking is multimedia players (and I especially miss Winamp).
Did you try the Audacious audio player? It was once forked from XMMS, which was a pretty good Winamp clone (for me anyways, haven't seen any Winamp since the year 2000
Ok sure, go ahead and run a 8k x 8k Linpack, tell me how that goes and how non-limited you are.
I guess if 8kx8k Linpack refers to matrix multiplication, I'm pretty sure that the Cell will perform at close to its 200GFLOP/s performance. Matrix multiplication can be really well broken down into block-matrix multiplications that nicely fit in the 256KB of available SRAM. Data transfer cost per block grows O(N^2), while FLOPs grow O(N^3). With 64x64 block matrixes you have to transfer 2*64*64 floats while having to compute 32 times as many multiply&adds. With 8 SPUs sharing the single XDR RAM you'll only have bandwidth to transfer 1/4 float per cycle and core. However, corresponding to that transfer rate, the core has to compute 32*1/4=8 madds per cycle, which happens to be twice the theoretical peak performance of the core. So no bandwidth problem at all. You'll be able to increase block size to 96x96 if you want to reduce bandwidth per FLOP even further. this paper claims close to peak performance for 2304x2304 sized matrices.
Sorry man, the Cell is fine for some things but the idea that it doesn't face the same realistic limits other hardware does is silly. You can talk all you like about high speed stuff on the cache, but that applies only for things that'll fit in there. When you have larger problem sets that have to go back and forth to main memory a lot, I'm afraid it isn't so fast.
You'll have these kind of losses on any architecture, including GPUs. What i'm saying is that the Cell is designed so that you'll almost always be able to reach close to 100% utilization of the available peak FLOP/s. That is something that you won't ever normally achieve on a GPU. On the other hand, GPUs have a much higher peak FLOP/s so loosing half of that due to bandwidth problem seems to be more acceptible.
Regardless my point was simply how unrealistic it is to call the thing a supercomputer. If a couple hundred GFLOPS makes a supercomputer then my GPU is a supercomputer.
Don't downplay what GPUs can do computationally either. They are the kings of Folding, yes ahead of the PS3. So long as your problem meets some requirements (highly parallel, single precision FP, fits in to GPU memory, not a lot of branching and when it branches everything branches the same direction) they scream. Is that all things? No, certainly not, you can find things they drag ass on. However the same happens with the Cell when compared to something like a Core i7. For some things, due to the SPEs the Cell is faster, however for others, due to constraints of the PPE it is slower.
Well it's not the constraints on the PPE that make the Cell slow. The Cell is just as fast as 200GPLOP/s can be. Modern GPUs are much faster than that, even if they cannot utilize all their horsepower. The downside is that GPUs rely on very different programming&compiler paradigms, where Cell was designed to mostly use multi-core, shared memory pradigms and standard C compilers with SIMD extensions, everything nicely managed by a standard MMU-utilizing operating system.
It is an interesting architecture and useful for some things, but it is not particularly impressive compared to other modern processors. Doesn't mean it is worthless, just that it is not "OMG this is so fast!".
It was that fast when it came out. Unfortunately development of the successors was cancelled by IBM, probably due to GPGPU programming eating its lunch.
The best they claim is 25.6 GFLOPS per cell in theoretical performance, so 205 GFLOPS is the best you theoretically get, if there are no bandwidth constraints (which there are on a PS3) for single precision math. Ok well testing my actual Radeon 5870, I get 800 GFLOPS for single precision, 227 for double precision. That is an actual benchmark of the card running on my desktop.
As somebody who programmed Cell CPUs for signal processing (including to, but not limited to PS3s), let me tell you that the PS3's memory bandwidth is so close to unlimited, that you usually don't have to think about it. At least as long as you move data only on the Element Interconnect Bus, between the 256KB local SRAMs of each CELL core, which is sufficient for most of what I did. It moves up to 200 giga bytes per second, maximum 16 bytes per 2 cycles in and out per core. The DMA engines that do those transfer have their own 1024bit (!) read/write port into the SRAM, so they burst 128 bytes per cycle into the SRAM, and don't have to steel many RAM cycles. The wikidedia article has more details.
In my experience, you can usually come pretty close to the 200 GFLOP/s of the Cell-CPU. When relying on C-Compiler with SIMD intrinsics, you usually manage 100 GPFOP/s for algorithms that have as many read/write opcodes as arithmetic opcodes. Smaller problems can mostly be handled on registers only (per CPU we have 128 16-byte registers!) and will run even faster.
Also note that many algorithms nowadays are not bandwidth but memory latency limited. Having the Cell's per-core DMA engines do background transfers to large local S-RAMs, mostly eliminates these latency problems and is much cleaner than relying on CPU caches guessing what parts of RAM to prefetch next. BTW these are user-space DMA engines that undergo page translation and are fully compatible to unix vm concepts. Still programming directly accesses DMA registers and doesn't need any kernel calls.
Try to do that with your GPU!
Darwin, the core of Mac OS X, is open source, for example, as well as Webkit, Apple's browser layout engine used in most browsers today, including Google Chrome and Android. And Grand Central Dispatch. And FaceTime. I could go on.
I won't say that this argument proves much about Apple's attitude towards openness. Webkit is based on KHTML which is LGPL and authorship not residing with Apple, so they absolutely had to open it up to satisfy the license.
The Darwin kernel is based on other free software work (mostly BSD?), BTW. No, the BSD license might not force Apple to open-source it, but it doesn't hurt so much, open-sourcing code that's freely available anyways.
1 picosecond (ps) is 10^(-12) secs. You can run a single instruction in a 1000 GHz CPU (please scale to your favourite multicore system) during 1 ps.
Actually you cannot even run a single instruction during 1ps. Modern CPUs are pipelined and make it look like an instruction takes one cycle. However, between loading of input, processing, and output, ca 20 cycles pass. It's just that with 20 instructions in the pipeline, you might get an effective throughput of 20 instructions per 20 cycles. And this does not even consider cache/memory and i/o access times.
Total I/O latency between signal arriving at PC, and PC answering may well be in the order of many 100 (if not 1000) cycles.
This restaurant was advertising breakfast any time. So I ordered french toast in the renaissance. - Steven Wright, comedian