dvdkhlng - Slashdot User

Comment Re:Digitask (Score 1) 104

by dvdkhlng on Tuesday October 11, 2011 @11:05AM (#37679502) Attached to: German State Confesses To, Downplays Government Spyware

Comment Re:its not selling well (Score 1) 83

by dvdkhlng on Saturday June 18, 2011 @02:57PM (#36486438) Attached to: NanoNote Goes Wireless

Submission + - NanoNote goes Wireless (qi-hardware.com)

Submitted by

dvdkhlng

on Saturday June 18, 2011 @06:56AM

Ask Amir Taaki About Bitcoin 768

Posted by timothy on Tuesday June 14, 2011 @11:06AM from the does-george-selgin-approve? dept.

Italy Votes To Abandon Nuclear Power 848

Posted by Soulskill on Tuesday June 14, 2011 @10:25AM from the giving-progress-the-boot dept.

Submission + - A Free and Open Replacement for Wireless LAN (qi-hardware.com) 3

Submitted by dvdkhlng on Tuesday June 14, 2011 @03:47AM

Comment Re:ok (Score 2) 99

by dvdkhlng on Wednesday May 11, 2011 @04:04AM (#36091156) Attached to: Consumer Device With Open CPU Out of Beta Soon

Comment Re:One right here! (Score 1) 441

by dvdkhlng on Monday May 09, 2011 @05:20PM (#36076274) Attached to: Ubuntu Aims For 200 Million Users In Four Years

Submission + - Consumer device with open CPU out of beta soon (milkymist.org)

Submitted by

lekernel

on Monday May 09, 2011 @01:07PM

Comment Re:Great-grandson of "Cheap Video Cookbook" (Score 3, Informative) 77

by dvdkhlng on Saturday May 07, 2011 @12:34PM (#36056860) Attached to: Micro-SD Card Slot Abused As VGA-Port

Submission + - Micro-SD Card Slot Abused as VGA-Port (qi-hardware.com)

Submitted by dvdkhlng on Saturday May 07, 2011 @05:45AM

Comment Re:Oh stop with the supercomputer bullshit (Score 1) 240

by dvdkhlng on Thursday May 05, 2011 @06:00PM (#36041798) Attached to: Gitbrew Releases OtherOS++ PS3 Linux Dual Boot

Ok sure, go ahead and run a 8k x 8k Linpack, tell me how that goes and how non-limited you are.

I guess if 8kx8k Linpack refers to matrix multiplication, I'm pretty sure that the Cell will perform at close to its 200GFLOP/s performance. Matrix multiplication can be really well broken down into block-matrix multiplications that nicely fit in the 256KB of available SRAM. Data transfer cost per block grows O(N^2), while FLOPs grow O(N^3). With 64x64 block matrixes you have to transfer 2*64*64 floats while having to compute 32 times as many multiply&adds. With 8 SPUs sharing the single XDR RAM you'll only have bandwidth to transfer 1/4 float per cycle and core. However, corresponding to that transfer rate, the core has to compute 32*1/4=8 madds per cycle, which happens to be twice the theoretical peak performance of the core. So no bandwidth problem at all. You'll be able to increase block size to 96x96 if you want to reduce bandwidth per FLOP even further. this paper claims close to peak performance for 2304x2304 sized matrices.

Sorry man, the Cell is fine for some things but the idea that it doesn't face the same realistic limits other hardware does is silly. You can talk all you like about high speed stuff on the cache, but that applies only for things that'll fit in there. When you have larger problem sets that have to go back and forth to main memory a lot, I'm afraid it isn't so fast.

You'll have these kind of losses on any architecture, including GPUs. What i'm saying is that the Cell is designed so that you'll almost always be able to reach close to 100% utilization of the available peak FLOP/s. That is something that you won't ever normally achieve on a GPU. On the other hand, GPUs have a much higher peak FLOP/s so loosing half of that due to bandwidth problem seems to be more acceptible.

Regardless my point was simply how unrealistic it is to call the thing a supercomputer. If a couple hundred GFLOPS makes a supercomputer then my GPU is a supercomputer.

Don't downplay what GPUs can do computationally either. They are the kings of Folding, yes ahead of the PS3. So long as your problem meets some requirements (highly parallel, single precision FP, fits in to GPU memory, not a lot of branching and when it branches everything branches the same direction) they scream. Is that all things? No, certainly not, you can find things they drag ass on. However the same happens with the Cell when compared to something like a Core i7. For some things, due to the SPEs the Cell is faster, however for others, due to constraints of the PPE it is slower.

Well it's not the constraints on the PPE that make the Cell slow. The Cell is just as fast as 200GPLOP/s can be. Modern GPUs are much faster than that, even if they cannot utilize all their horsepower. The downside is that GPUs rely on very different programming&compiler paradigms, where Cell was designed to mostly use multi-core, shared memory pradigms and standard C compilers with SIMD extensions, everything nicely managed by a standard MMU-utilizing operating system.

It is an interesting architecture and useful for some things, but it is not particularly impressive compared to other modern processors. Doesn't mean it is worthless, just that it is not "OMG this is so fast!".

It was that fast when it came out. Unfortunately development of the successors was cancelled by IBM, probably due to GPGPU programming eating its lunch.

Comment Re:Oh stop with the supercomputer bullshit (Score 5, Informative) 240

by dvdkhlng on Thursday May 05, 2011 @10:46AM (#36035236) Attached to: Gitbrew Releases OtherOS++ PS3 Linux Dual Boot

The best they claim is 25.6 GFLOPS per cell in theoretical performance, so 205 GFLOPS is the best you theoretically get, if there are no bandwidth constraints (which there are on a PS3) for single precision math. Ok well testing my actual Radeon 5870, I get 800 GFLOPS for single precision, 227 for double precision. That is an actual benchmark of the card running on my desktop.

As somebody who programmed Cell CPUs for signal processing (including to, but not limited to PS3s), let me tell you that the PS3's memory bandwidth is so close to unlimited, that you usually don't have to think about it. At least as long as you move data only on the Element Interconnect Bus, between the 256KB local SRAMs of each CELL core, which is sufficient for most of what I did. It moves up to 200 giga bytes per second, maximum 16 bytes per 2 cycles in and out per core. The DMA engines that do those transfer have their own 1024bit (!) read/write port into the SRAM, so they burst 128 bytes per cycle into the SRAM, and don't have to steel many RAM cycles. The wikidedia article has more details.

In my experience, you can usually come pretty close to the 200 GFLOP/s of the Cell-CPU. When relying on C-Compiler with SIMD intrinsics, you usually manage 100 GPFOP/s for algorithms that have as many read/write opcodes as arithmetic opcodes. Smaller problems can mostly be handled on registers only (per CPU we have 128 16-byte registers!) and will run even faster.

Also note that many algorithms nowadays are not bandwidth but memory latency limited. Having the Cell's per-core DMA engines do background transfers to large local S-RAMs, mostly eliminates these latency problems and is much cleaner than relying on CPU caches guessing what parts of RAM to prefetch next. BTW these are user-space DMA engines that undergo page translation and are fully compatible to unix vm concepts. Still programming directly accesses DMA registers and doesn't need any kernel calls.

Try to do that with your GPU!

Comment Re:At some point, it's just bashing... (Score 1) 120

by dvdkhlng on Tuesday April 26, 2011 @09:51AM (#35941410) Attached to: Google Announces WebM Community Cross Licensing

Comment Re:It's all about maths, you insensitive clod! (Score 1) 448

by dvdkhlng on Friday March 04, 2011 @06:24AM (#35377798) Attached to: Contemplating Financial Trading At Picosecond Resolution

Slashdot Top Deals