Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?

The Potential of Science With the Cell Processor 176

prostoalex writes "High Performance Computing Newswire is running an article on a paper by computer scientists at the U.S. Department of Energy's Lawrence Berkeley National Laboratory. They have evaluated the processor's performance in running several scientific application kernels, then compared this performance against other processor architectures. The full paper is available from Computer Science department at Berkeley."
This discussion has been archived. No new comments can be posted.

The Potential of Science With the Cell Processor

Comments Filter:
  • by suv4x4 ( 956391 ) on Sunday May 28, 2006 @08:19AM (#15419971)
    The paper did a lot of hand-optimization, which is irrelevent to most programmers. What gcc -O3 does is way more importent then what an assembly wizard can do for most projects.

    Actually bullshit. We're talking scientific applications here, and it's not uncommon that programs written to run on supercomputers *are* optimized by an assembly wizard to squeeze every cycle out of it.

  • by Poromenos1 ( 830658 ) on Sunday May 28, 2006 @08:33AM (#15419995) Homepage
    Doesn't the Cell's design mean that it can very easily scale up, without requiring any changes in the software? Just add more computing CPUs (SPEs they are called, I think?) and the Cell runs faster without changing your software.

    I'm not entirely sure of this, can someone corroborate/disprove?
  • Ease of Programming? (Score:3, Interesting)

    by MOBE2001 ( 263700 ) on Sunday May 28, 2006 @09:56AM (#15420235) Homepage Journal
    FTA: While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors,

    The Cell processor may be faster but how easy is it to implement an optimizing development system that eliminates the need to hand-optimized the code? Is not programming productivity just as important as performance? I suspect that the Cell's design is not as elegant (from a programmer's POV) as it could have been, only because it was not designed with an elegant software model in mind. I don't think it is a good idea to design a software model around a CPU. It is much wiser to design the CPU around an established model. In this vein, I don't see the cell as a truly revolutionary processor because, like every other processor in existence, it is optimized for the algorithmic software model. A truly innovative design would have embraced a non-algorithmic, reactive, synchronous model, thereby killing two birds with one stone: solving the current software reliability crisis while leaving other processors in dust in terms of performance. One man's opinion.
  • by cfan ( 599825 ) on Sunday May 28, 2006 @11:23AM (#15420516) Homepage
    >So the Cell is great because there's going to be millions of them sold in >PS3's so they'll be cheap. But it's only really great if a new
    >custom variant is built. Sounds kind of contradictory.

    No, the Cell is great because, as the pdf shows, it has an incredible Gflops/Power ratio, even in its current configuration.

    For example, here are the Gflops (double precision) obtained in 2d FFT:

          Cell+ Cell X1E AMD64 IA64

    1K^2 15.9 6.6 6.99 1.19 0.52
    2K^2 26.5 6.7 7.10 0.19 0.11

    So a single, normal, Cell can be compared with the processor of a Cray (that uses 3 times more power and costs a lot more).
  • by john.r.strohm ( 586791 ) on Sunday May 28, 2006 @11:24AM (#15420522)
    Irrelevant to most C/C++ code wallahs doing yet another Web app, perhaps.

    Irrelevant to people doing serious high-performance computing, not hardly.

    I am currently doing embedded audio digital signal processing, On one of the algorithms I am doing, even with maximum optimization for speed, the C/C++ compiler generated about 12 instructions per data point, where I, an experienced assembly language programmer (although having no previous experience with this particular processor) did it in 4 instructions per point. That's a factor of 3 speedup for that algorithm. Considering that we are still running at high CPU utilization (pushing 90%), and taking into account the fact that we can't go to a faster processor because we can't handle the additional heat dissipation in this system, I'll take it.

    I have another algorithm in this system. Written in C, it is taking about 13% of my timeline. I am seriously considering an assembly language rewrite, to see if I can improve that. The C implementation as it stands is correct, straightforward, and clean, but the compiler can only do so much.

    In a previous incarnation, I was doing real-time video image processing on a TI 320C80. We were typically processing 256x256 frames at 60 Hz. That's a little under four million pixels per second. The C compiler for that beast was HOPELESS as far as generating optimal code for the image processing kernels. It was hand-tuned assembly language or nothing. (And yes, that experience was absolutely priceless when I landed on my current job.)
  • by golodh ( 893453 ) on Sunday May 28, 2006 @11:26AM (#15420527)
    Although I agree with your point that crafting optimised assembly language routines is way beyond most users (and indeed a waste of time for all but an expert) there are certain "standard operations" that

    (a) lend themselves extremely well to optimisation

    (b) lend themselves extremely well to incorporation in subroutine libraries

    (c) tend to isolate the most compute-intensive low-level operations used in scientific computation


    If you read the article, you will find (among others) a reference to a operation called "SGEMM". This stands for Single precision General Matrix Multiplication. This is the sort of routines that make up the BLAS library (Basic Linear Algebra Subprograms) (see e.g. http://www.netlib.org/blas/ [netlib.org]). High performance computation typically starts with creating optimised implementation of the BLAS routines (if necessary handcoded at assembler level), sparse-matrix equivalents of them, Fast Fourier routines, and the LAPACK library.


    There is a general movement away from optimised assembly language coding for the BLAS, as embodied in the ATLAS software package (Automatically Tuned Linear Algebra Software; see e.g. http://math-atlas.sourceforge.net/ [sourceforge.net]). The ATLAS package provides the BLAS routines but produces fairly optimal code on any machine using nothing but ordinary compilers. How? If you run a makefile for the ATLAS package, it may take about 12 hours (depending on your computer of course; this is a typical number for a PC) or so to compile. In this time the makefile will simply run through multiple switches and for the BLAS routines and run testsuites for all its routines for varying problem sizes. And then it picks the best possible combination of switches for each routine and each problem size for the machine architecture on which it's being run. In particular it takes account of the size of caches. That's why it produces much faster subroutine libraries than those produced by simply compiling e.g. the BLAS routines with an -O3 optimisation switch thrown in.

    Specially tuned versus automatic?: MATLAB

    The question is of course: who wins? Specially tuned code or automatic optimisation? This can be illustrated with the example of the well-known MATLAB package. Perhaps you have used MATLAB on PC's, and wondered why its matrix and vector operations are so fast? That's because for Intel and AMD processors it uses a specially (vendor-optimised) subroutine library (see http://www.mathworks.com/access/helpdesk/help/tech doc/rn/r14sp1_v7_0_1_math.html [mathworks.com]) For SUN machines, it uses SUN's optimised subroutine library. For other processors (for which there are no optimised libraries) Matlab uses the ATLAS routines. Despite the great progress and portability that the ATLAS library provides, carefully optimised libraries can still beat it (see the Intel Math Kernel Library at http://www.intel.com/cd/software/products/asmo-na/ eng/266858.htm [intel.com])


    In summary:

    -large tracts of Scientific computation depend on optimised subroutine libraries

    -hand-crafted assembly-language optimisation can still outperform machine-optimised code.

    Therefore the objections that the hand-crafted routines described in the article distort the comparison or are not representative of real-world performance are invalid.

    However ... it's so expensive and difficult that you only ever want to do it if you absolutely must. For scientific computation this typically means that you only consider handcrafting "inner loop primitives" such as the BLAS routines, FFT's, SPARSEPACK routines etc. for this treatment, and that you just don't attempt to do that yourself.

  • by Duncan3 ( 10537 ) on Sunday May 28, 2006 @01:26PM (#15420953) Homepage
    I love how they manage to completely ignore all the other vector-type architectures already in the market, and just compare it to Intel/AMD which are not even designed for floating point performance.

    Scream "my computer beats your abacus" all you want.

    But then it is from Berkeley, so that's normal. ;)

Don't get suckered in by the comments -- they can be terribly misleading. Debug only code. -- Dave Storer