Please create an account to participate in the Slashdot moderation system


Forgot your password?
Science Technology

GRAPE6, Now With GNU/Linux Frontend, At 32 TFlops 45

teuben writes "I am attending the "Astrophysical Supercomputing using Particle Simulations" conference here in Tokyo, and during the first session yesterday Jun Makino announced that the GRAPE6 is now operational and running with a 4 headed linux system running 1.7GHz PC each (not a Quad, just 4 individual PCs). This prototype is now running at 32 Tflops! Best of the news is that this prototype is scalable, and this configuration is only 1/4 of the final one. Funding currently limits building faster grapes. Check out for the conference website, and for the GRAPE website." But that's not all -- Peter also has word on how you (or more likely your local astrophysics department, since that's what it's best for) can get a grape of your own, and on electronics in Japan.

You can also get a baby-grape, see pictures on 90014.html which runs a good fraction of a TFlop, and will cost somewhere around 10k$.

I have some more pictures on which shows the 1/4 size Grape6 running 32 Gflop. The final full version would cost about 1M$. Compare that to the AsciWhite at 12 Tflop for 100M$. Drawback of course is that the Grape only computes things similar to the gravitational N-body problem (also useful for pharmaceutical industries).

Btw, also spent some time in Akihabara on sunday, I guess we're deprived on the US east coast, the amount of DVD writers you can get here is amazing. Also very popular here seem to be all kinds of embedded units, e.g. the GPS in your car to not get lost in Tokyo!

There was an ABC news story earlier in the year on the GRAPE, but at the time it was running alpha's with their unix. They have now fully switched to linux, and this system has been running since July 5."

This discussion has been archived. No new comments can be posted.

GRAPE6, Now With GNU/Linux Frontend, At 32 TFlops

Comments Filter:
  • The interesting thing to me here is how well some simple special purpose hardware can do at certain classes of problems. This sort of flies in the face of the trend towards generals COTS hardware and general languages for computation.

    The last time I saw cool and useful specialized hardware was the EFF's cracking machine that won the distributed contest.

    We talk about, for example, Java being fast enough to compete with compiled languages, but the fact of the matter is that a general system could not achieve anywhere near 32TFlops peak performance on standard PC clusters where you really just need raw computational speed. I think some other people mentioned that SIMD will get you in the GFlops range, but that is 3 orders of magnitude below the Grapes machines.

    Before Seymour Cray was killed, one of the last thing he was working on was a project aiming for Petaflops performance. You can see just what a high goal that still is. (A Petaflop is 1000 Teraflops!)

    I remember when transputers used to be advertised a lot in Byte and other computer magazines. I wonder if we'll ever see a return of something similar. Grapes seems pretty specialized, but something a little more general like FPGA add on boards might be a good way to get good price/performance on a PC base (i.e. using a PC cluster instead of an expensive supercomputer). The applications would be limited to computationally intensive things. But, for example, 3D rendering for movie animation might be better done on more specialized hardware.

  • The write-up for this article is just a tad bit misleading. The 32 TFLOPS figure is the "theoretical peak". This is a favorite number for hardware manufacturers to quote, since the theoretical peak far, far exceeds what anyone will see in practice, even when solving the most amenable of problems. To suppose that this hardware will get anywhere near 32 TFLOPS during actual use is just nuts.
  • Imagine a Beowulf Cluster of these!


    (Yes, I know that its limited hardware. It's just sorta expected.)

  • It looks like the GRAPE boards should be re-configurable. Why'd they do it with custom chips?
    A whack of FPGA's should be pretty decent, but you can configure it for more than just as a N-body gravitational problem.

    BTW, Akhihabara is over-rated. There's a WHACK of stuff there that we don't get in North. America. But wander for a few hours, and you soon realize that the same store exists on every block, repeated over and over and over...
    Besides Akihabara doesn't usually have the best prices. I loafed around all over Tokyo, and it usually has the highest prices. Just pop over on the train to Ikebukero or something, they'll have the same mega chain stores (just not repeated every block) and usually lower prices.

    I found digital cameras weren't cheaper or better. The MD players kick ass! 320min playtime per disk now in about the size of 3 3.5" floppies stacked up.
    And there there's the colour, digital cell phones about 1/2 the size of ours for about $10-50US. Woohoo!
  • Checked out the latest Alteras? -i ndex.html

    One thing you get with a FPGA, parallelism, you can have as many execution units as you have gates to implement and if you need more, add chips, you do have to pay the IO penalty, but it can work out to single cycle operations without any of the pipeline stalls you get in a general purpose processor.
    The other nice thing, most if these parts are reprogrammable, so algorithm tweaks are possible.
  • Maybe now we can compute the emotional state your girlfriends mind, and know when to just hit the doghouse before we hit the door.
  • That is six years we will hit that, and in 7 years beat that limit. Now to see who's right.
  • That's how they do work. A 100Tflop grape6 machine would have 3072 individual units powering it.
  • Grape6 is 32Tflops, not 32Gflops. You are off by three orders of magnitude. COTS cannot achieve anywhere near this level of speed currently.
  • As a matter of fact, my advisor (who's at that conference - leaving me back here to play with my new laptop while my workstation analyzes simulations for me) has one. Just 4 nodes with one board each for now to get the code working, but will be scaled up when we're confident in the code. :-)=

  • As other posters have stated, the GRAPE6 system uses custom special purpose hardware, but one should also keep in mind that 32 Teraflops is the theoretical peak performance. A complete GRAPE6 system is supposed to have a peak speed of 100 Tflops.

    If you read the paper at about the prototype GRAPE6 system then you would have notice, according to the paper, when they actually did a simulation of the evolution of a galactic nucleus containing triple massive black holes, they only got about half the theoretical peak performance of the prototype.

  • It's correct that the GRAPE [GRAvity PipelinE] machines are not general purpose computers -- they only compute forces [the O(n^2) part of starfield calculations, for example]. We've got a GRAPE4 and two of the MDGRAPEs [Molecular Dynamics, ie more general force laws]. A 16-processor GRAPE6 is due RSN.

    We're running some tests later this month, shipping data from the Tokyo GRAPE farm across TransPac to the Indianapolis HPSS silo, testing differentiated QoS [another whole thread, involving Napster and GriPhyN]. The idea is to eventually send slices of the data on to the American Museum of Natural History Planetarium in New York, linking three "specialized instruments."

  • vücudu sýkan giysiler çýkarýlmalý, basma soðuk kompres veya buz torbasý konulmalýdýr. Ate çok yüksekse vücut ýslak bir çarafla sarýlmalý, hasta havadar bir yerde

    Yow! These "*BSD is Dying" posts are getting weirder and weirder...

  • I should have waited for more of the grape website to actually load, and install chinese text support. I am dumb and apologize for wasting your time. Below is some useless crap I concocted to try to save my ass from looking like an idiot. I failed, just like I failed out of college and at just about every other aspect of my miserable fucking life. It is funny how that after a while you get used it, and sleep a lot to pass the time--because if you don't, you end up wanting to impale your head on a pointy wrought-iron fence and just be done with it.

    I guess the specialized boards are doing far more floating point operations per set of data sent to them then I thought. I didn't see how they could do this without saturating the bus to the number crunching hardware -- or especially saturating the PC's cpu itself, because it would have to send the information to the boards, retrieve the results, and do I/O.

  • I find it hard to believe that any 1.7ghz PC can pull off 8 Teraflops, as the article states. (32/4) A single Gigaflop is attainable, but depends on what FLOPS weighting they are using....e.g. flops1, flops2...etc, which have different amounts of floating point divides, additions, and the like.

    -Mike (on a 24 MegaFlop Indigo2)
  • the kind of RC5 rate one of these would get!
  • I think there's a few corrections necessary.
  • Sounds like a catapult.
  • Drawback of course is that the Grape only computes things similar to the gravitational N-body problem (also useful for pharmaceutical industries).
    So? If it's Turing-complete I can read slashdot on it -- or any other app.[1] Just a question of how long...

    after all, linux itself was a hack to get unix onto x86...

    [1] of course Slashdot will run equally slowly. But imagine your {FPS title} frame rate!
  • Something is wrong, I could actually read the article, and... see the pictures! wow.
  • its not being served by the GRAPE6.
  •'s not always best, humor-wise, to go for the low-hanging fruit...
  • Although it brakes my hart to see my precious Karma go, I do believe that the 'Redundant' moderation of my earlier post is most approriate!
  • I hate to correct someone, but a dual-G4 set-up running at 733 mHz will get you 7 gigaflops. Not too shabby, now is it?
  • I can imagine it easily: Zero. The GRAPE units can't do RC5. Read the article.

  • do they have seeds?

    Having a vinyard would be quite the cluster of GRAPEs.
  • It looks like the GRAPE boards should be re-configurable. Why'd they do it with custom chips?

    A whack of FPGA's should be pretty decent, but you can configure it for more than just as a N-body gravitational problem.

    Because in silicon it's pretty much twice as fast as an FPGA gets? (EE's rule of thumb, admitingly reffering to microwave app's).

    Not only that, but why would they want to do something other than N-body gravitational problems? _You_ might, but there are a lot of such problems to do, and that's what this is designed for.
  • Ahh yes, that's right... I've never been given a chance to use ours, so I was speaking from what I remembered from the little reading I've done ;)

  • ...through the GRAPEvine?

    Worldcom [] - Generation Duh!
  • >I've never been given a chance to use ours

    Probably because we're not convinced that the lockout works properly on the GRAPE5s. I know it works well on the GRAPE3, but VE and MS have done some tests on the GRAPE5 where they've tried hammering it with 2 different jobs, and the results haven't been kosher.

    Anyone have experience with GRAPE5 and notice this? Anything that could be changed in the API? I suppose we could write a wrapper around the g5_open and g5_close calls that does additional locking, but that seems inelegant.


  • Hi Doug! :-)=

    Just a clarification:

    (actually some of them do SPH (smoothed particle hydrodynamics) as well)

    The boards themselves don't do the SPH calculations. What they do is return neighbour lists for each particle, which reduces the load necessary to compute the hydro forces.


  • Fast?

    Nope... the GRAPES are each one sweet system!
  • enough people did answer that. but in that vain I should comment that some collegues of mine added some assembler code (incidentally for the same Nbody code). He's been using the 3DNow SIMD instruction set on the AMD directly, and was able to get about 2 billion PP interactions in 45 seconds, which translates to about 2 Gflop on a 1.2 GHz AMD (his math). With 8-10 of such athlons they could compete wiht the Grape5 in speed. Of course that's still far from the Grape6 speed. But depending on your problem and budget, you can still get pretty far with COTS.
  • by sharkey ( 16670 ) on Tuesday July 10, 2001 @02:03PM (#93403)
    Actually, I think it would be called a "bunch" of GRAPEs.

  • by RobertFisher ( 21116 ) on Tuesday July 10, 2001 @03:08PM (#93404) Homepage Journal
    The key point in this analysis is that they get 32 TFlops in doing the gravity summation for a discrete particle simulation.

    If you have additional physics (hydrodynamics, etc), that processing must happen on the workstation which is running the simulation. So the performance is ultimately bottlenecked by the workstation. In practice, Grape practioners typically do not see anything close to the theoretical peak of their boards.

  • by drudd ( 43032 ) on Tuesday July 10, 2001 @01:40PM (#93405)
    Read the article...

    Grape boards are highly specialized chips which do nothing but N^2 direct summation gravity force calculations (actually some of them do SPH (smoothed particle hydrodynamics) as well).

    You take a pc/sun/beowulf cluster and link it to a set of grape boards. You then send particles to the boards and get accelerations back.

  • by bentini ( 161979 ) on Tuesday July 10, 2001 @01:38PM (#93406)

    FPGA's are, unfortunately, slow.

    They're made with a worse process than custom chips. For inner loops, you want as fast as you can get. You pay for programmability, and if it's always the same task, special-purpose is best.

    It's like the difference between hand-assembled code and a compiler. You get it easier with the compiler, but hand-assembling can be better when you know the specifics.

    The n-body gravitational problem is going to be around for a while, so it makes sense to customize to it.

  • by 2Bits ( 167227 ) on Tuesday July 10, 2001 @02:33PM (#93407)
    The final full version would cost about 1M$. Compare that to the AsciWhite at 12 Tflop for 100M$.

    What??? A machine like would cost one Microsoft? Either I have been sleeping thru all this time while inflation is running rampant, or M$ is not worth that much anymore.

  • by grammar nazi ( 197303 ) on Tuesday July 10, 2001 @01:13PM (#93408) Journal
    Q: What did the Grape9 say when it was crushed with my number windows factoring algorithm?

    Give up?

    A: Nothing, it just made a little Wine.

  • by Drakula ( 222725 ) <tolliver&ieee,org> on Tuesday July 10, 2001 @01:18PM (#93409) Homepage Journal
    Does this mean that other OSes (cough, Windows, cough) should have sour grapes?
  • by DarkMan ( 32280 ) on Tuesday July 10, 2001 @01:43PM (#93410) Journal
    Read the article.

    With refference to the calculatiosn they are doing, they are simply doing

    G * m_i * SumOverAll(j .NE. i) (x_j - x_i) / (x_j - x_i)^3

    They are doing this by custom hardware.

    This is not a general purpose computer.

    Despite what the blurb said, there are 96 independant units doing the calculation, in each machine, to get the 32 TFlops across the system.

    There is a picture of an earlier model, which is about the size of one of my filing cabinets.

    Remeber these are scientists, not marketing, making those claims. They expect to be asked to justify them - and they have.
  • by Matt2000 ( 29624 ) on Tuesday July 10, 2001 @01:57PM (#93411) Homepage

    How slashdot slows scientific progress in the world:

    1. Oh look, and interesting story on academic research on slashdot.
    2. Oh look, a lovely link to those poor academic's website. Surely they have the $40k necessary to make a server that can handle the load from slashdot?
    3. Oh look, the reeking Sun Ultra 5 that they were using for web duties has burst into flame, destroying the lab and scaring a small puppy that lives in the lab next door.

    To hell with you slashdot for burning puppies.
  • by devphil ( 51341 ) on Tuesday July 10, 2001 @01:27PM (#93412) Homepage

    A: GRAPEs and chess-playing computers, such as the one that tackled Kasparov (Deep Blue?), both accomplish their opening-up-of-cans-of-mathematical-whoopass via the same approach: functions in the innermost loops are done via calls to special-purpose hardcare cards. The rest is done with software.

    So, say I take a GRAPE, and replace its special N-body gravitational daughtercard with one containing a few FPGAs programmed for, say, RC5; now I have a cracking machine. And then reprogram the FPGA to do image manipulation instead; now I have a renderer to make my own Toy Story. And then reprogram the FPGA to do, etc, etc.

    Of course, I'm still lacking the software. So actually this post is mostly babbling. :-)

Behind every great computer sits a skinny little geek.