Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Hardware

Submission + - Micro-SD Card Slot Abused as VGA-Port (qi-hardware.com)

dvdkhlng writes: The guy who did this calls it an "unexpected capability". The Ben NanoNote open-source hand-held computer has often been criticized for not being very extensible hardware-wise. A community-effort now starts to challenge this by shipping the so-called UBB board, that plugs into the micro-SD port, making 6 I/O lines available to hardware hackers. The most impressive use so far is this VGA port implemented by just a few resistors, with signal-generation mostly controlled by software. Schematics and source code are available under the GPL.

Comment Re:Oh stop with the supercomputer bullshit (Score 1) 240

Ok sure, go ahead and run a 8k x 8k Linpack, tell me how that goes and how non-limited you are.

I guess if 8kx8k Linpack refers to matrix multiplication, I'm pretty sure that the Cell will perform at close to its 200GFLOP/s performance. Matrix multiplication can be really well broken down into block-matrix multiplications that nicely fit in the 256KB of available SRAM. Data transfer cost per block grows O(N^2), while FLOPs grow O(N^3). With 64x64 block matrixes you have to transfer 2*64*64 floats while having to compute 32 times as many multiply&adds. With 8 SPUs sharing the single XDR RAM you'll only have bandwidth to transfer 1/4 float per cycle and core. However, corresponding to that transfer rate, the core has to compute 32*1/4=8 madds per cycle, which happens to be twice the theoretical peak performance of the core. So no bandwidth problem at all. You'll be able to increase block size to 96x96 if you want to reduce bandwidth per FLOP even further. this paper claims close to peak performance for 2304x2304 sized matrices.

Sorry man, the Cell is fine for some things but the idea that it doesn't face the same realistic limits other hardware does is silly. You can talk all you like about high speed stuff on the cache, but that applies only for things that'll fit in there. When you have larger problem sets that have to go back and forth to main memory a lot, I'm afraid it isn't so fast.

You'll have these kind of losses on any architecture, including GPUs. What i'm saying is that the Cell is designed so that you'll almost always be able to reach close to 100% utilization of the available peak FLOP/s. That is something that you won't ever normally achieve on a GPU. On the other hand, GPUs have a much higher peak FLOP/s so loosing half of that due to bandwidth problem seems to be more acceptible.

Regardless my point was simply how unrealistic it is to call the thing a supercomputer. If a couple hundred GFLOPS makes a supercomputer then my GPU is a supercomputer.

Don't downplay what GPUs can do computationally either. They are the kings of Folding, yes ahead of the PS3. So long as your problem meets some requirements (highly parallel, single precision FP, fits in to GPU memory, not a lot of branching and when it branches everything branches the same direction) they scream. Is that all things? No, certainly not, you can find things they drag ass on. However the same happens with the Cell when compared to something like a Core i7. For some things, due to the SPEs the Cell is faster, however for others, due to constraints of the PPE it is slower.

Well it's not the constraints on the PPE that make the Cell slow. The Cell is just as fast as 200GPLOP/s can be. Modern GPUs are much faster than that, even if they cannot utilize all their horsepower. The downside is that GPUs rely on very different programming&compiler paradigms, where Cell was designed to mostly use multi-core, shared memory pradigms and standard C compilers with SIMD extensions, everything nicely managed by a standard MMU-utilizing operating system.

It is an interesting architecture and useful for some things, but it is not particularly impressive compared to other modern processors. Doesn't mean it is worthless, just that it is not "OMG this is so fast!".

It was that fast when it came out. Unfortunately development of the successors was cancelled by IBM, probably due to GPGPU programming eating its lunch.

Comment Re:Oh stop with the supercomputer bullshit (Score 5, Informative) 240

The best they claim is 25.6 GFLOPS per cell in theoretical performance, so 205 GFLOPS is the best you theoretically get, if there are no bandwidth constraints (which there are on a PS3) for single precision math. Ok well testing my actual Radeon 5870, I get 800 GFLOPS for single precision, 227 for double precision. That is an actual benchmark of the card running on my desktop.

As somebody who programmed Cell CPUs for signal processing (including to, but not limited to PS3s), let me tell you that the PS3's memory bandwidth is so close to unlimited, that you usually don't have to think about it. At least as long as you move data only on the Element Interconnect Bus, between the 256KB local SRAMs of each CELL core, which is sufficient for most of what I did. It moves up to 200 giga bytes per second, maximum 16 bytes per 2 cycles in and out per core. The DMA engines that do those transfer have their own 1024bit (!) read/write port into the SRAM, so they burst 128 bytes per cycle into the SRAM, and don't have to steel many RAM cycles. The wikidedia article has more details.

In my experience, you can usually come pretty close to the 200 GFLOP/s of the Cell-CPU. When relying on C-Compiler with SIMD intrinsics, you usually manage 100 GPFOP/s for algorithms that have as many read/write opcodes as arithmetic opcodes. Smaller problems can mostly be handled on registers only (per CPU we have 128 16-byte registers!) and will run even faster.

Also note that many algorithms nowadays are not bandwidth but memory latency limited. Having the Cell's per-core DMA engines do background transfers to large local S-RAMs, mostly eliminates these latency problems and is much cleaner than relying on CPU caches guessing what parts of RAM to prefetch next. BTW these are user-space DMA engines that undergo page translation and are fully compatible to unix vm concepts. Still programming directly accesses DMA registers and doesn't need any kernel calls.

Try to do that with your GPU!

Comment Re:At some point, it's just bashing... (Score 1) 120

Darwin, the core of Mac OS X, is open source, for example, as well as Webkit, Apple's browser layout engine used in most browsers today, including Google Chrome and Android. And Grand Central Dispatch. And FaceTime. I could go on.

I won't say that this argument proves much about Apple's attitude towards openness. Webkit is based on KHTML which is LGPL and authorship not residing with Apple, so they absolutely had to open it up to satisfy the license.

The Darwin kernel is based on other free software work (mostly BSD?), BTW. No, the BSD license might not force Apple to open-source it, but it doesn't hurt so much, open-sourcing code that's freely available anyways.

Comment Re:It's all about maths, you insensitive clod! (Score 1) 448

1 picosecond (ps) is 10^(-12) secs. You can run a single instruction in a 1000 GHz CPU (please scale to your favourite multicore system) during 1 ps.

Actually you cannot even run a single instruction during 1ps. Modern CPUs are pipelined and make it look like an instruction takes one cycle. However, between loading of input, processing, and output, ca 20 cycles pass. It's just that with 20 instructions in the pipeline, you might get an effective throughput of 20 instructions per 20 cycles. And this does not even consider cache/memory and i/o access times.

Total I/O latency between signal arriving at PC, and PC answering may well be in the order of many 100 (if not 1000) cycles.

Comment Re:Probably Wrong but Clearly Falsifiable (Score 1) 700

I don't have any specific sources, other than the original paper linked by TFA. For my statements about using trellis algorithm, try any books about coding theory. Unfortunately the wikipedia article on trellis is not very helpful.

The more general idea behind this is dynamic programming. If you have an equation of N boolean variables, you can brute force in 2^N. If you have M equations of 2^N variables, where equation share variables only with one or more neighbouring equations, you can determine solutions for every equation on its own in M*2^N steps and save those (partial) solutions. Then to find a solution that satifies all equations you just have to find a "path" through all equations from left to right. That path has M*2^N nodes. The edges of the graph connect solutions that do not collide (i.e. where variables overlapping between solutions take the same value). Since only neighbouring equations overlap in their variables, the number of edges per node is some (not too large) constant. Finding such a path is now straight-forward and polynomial effort. see also here.

So for the riplets in this paper N is 3, so 2^N is a constant, leaving no exponential in the computational cost function. I hoped other people could comment on their impression of the paper, but as this is /. RTFA is not so common I guess :)

Comment Re:Probably Wrong but Clearly Falsifiable (Score 2) 700

This is not a P=NP paper. The paper solves a problem of a related data structure in polynomial time (quartic time), then shows that it can be used to solve some cases of 3SAT. The 3 outputs the algorithm can give are "the formula is not satisfiable", "the formula is satisfiable" (and the solution is given), and "failure of classification" -- it couldn't solve the problem. The important question we wait for on the experts on this paper isn't "is it correct" (it probably is) but "how effective is it".

In fact it does suffice to show that the algorithm determines satisfiability of a 3-SAT instance in polynomial time. Create a derived 3-SAT instance that adds a clause restricting one variable to value '1'. Then see whether it is still satisfiable. It is not? Ok, constrain the variable to '0', and add a constraint for the next variable. Is it now satisfiable? And on and on. You understand the pattern? For n variables it only takes O(n) time to turn the 3-SAT satisfiability testing algorithm into a 3-SAT solver. Or in CS terms: the 3-SAT decision problem is polynomial time reducable to the 3-SAT solving problem.

Comment Re:I'll be first to say WTF (Score 1) 700

There are so many errors in your comment that I almost don't know where to start:

Comment Re:Probably Wrong but Clearly Falsifiable (Score 5, Interesting) 700

Maybe I'm overlooking something, but to me it looks like they're doing the reduction to a polynomial-time problem already at the very beginning of the paper (I guess if there is a fault, there it hides). As soon as they go to "compact triplet" structure, the instance of 3-SAT is polynomial-time solvable using a trellis algorithm. Yes, very similar to the algorithm that is employed to decode convolutional codes.

In fact they're decomposing the initial 3-SAT problem into multiple "compact triplet" 3-SAT problems intersected using an AND operation. But as these intersected 3-SAT formulas use the same variables, without any interleaving (permutation) applied, the trellis algorithm still applies (just like solving for a convolutional encoder with > 1 check bits per bit).

Thinking once more about that: the compact triplet structure is clearly not general enough to express generic 3-SAT problems. This is like attempting to transform a quadratic optimization problem x^T*H*x involving a symmetric matrix H into a corresponding problem with a tri-diagonal matrix H.

The only way I see they could do the transform is by introducing exponentially many helper variables, thus moving the problem back into NP again. But it does not look like they're attempting something like that.

Slashdot Top Deals

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...