Become a fan of Slashdot on Facebook


Forgot your password?
Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. ×

IBM Creates New Fastest Beowulf Cluster 154

shawnb writes "It seems that IBM has created the world's fastest Linux cluster built from lots of small servers (64 IBM Netfinity 256 servers). The Netfinity servers are linked together using "special clustering software and high-speed networking hardware, which causes the separate units to act as one computer, delivering a processing speed of 375 gigaflops, or 375 billion operations per second." They also go on to say that this is the fastest Linux supercomputer, "it will only rank 24th on the list of the top 500 fastest supercomputers. " "
This discussion has been archived. No new comments can be posted.

IBM Creates New Fastest Beowulf Cluster

Comments Filter:
  • by Anonymous Coward
    According to this article [] it's not 64 Netfinity 256's (does there even exist a Netfinity 256 model?), but 256 2 CPU Netfinity servers. (Maybe the new Netfinity 4000's... ?)
  • by Anonymous Coward
    "it will only rank 24th on the list of the top 500 fastest supercomputers. " - don't they remember the first beowulf clusters that ranked at 250 or lower? It's like "my car only drives at mach 3" ;-)
  • by Anonymous Coward
    I wonder if we can run Linux on this. Oh, wait we can.

    Then wonder if we can make a Beowolf of these things! Oh wait, that's what they are doing with em...

    Ummm, ahhh, ummmm... Grits anyone?
  • That's seems pretty amazing to me, that one can 'throw' 64 readily available systems together to create the 24th most powerful system in the world.
  • From the article they say that clusters max out at 64 machines, limiting their size - but also it's claimed that the cluster acts like a single machine, so my question is, why can't you cluster the clusters to use 4096 machines. Is it simply a case of (lack of) bandwidth linking the machines together?

    It's probably a software limitation, but probably not a bad one. Large clusters get unwieldy quickly, and network latency and bandwidth is the bane of any parallel programmers existence. Communication between the nodes of a cluster is several orders of magnitude slower than referencing internal memory, and any no real parallel program has autonomous nodes. It's no coincidence that Donald Becker, a major contributor to Beowulf on Linux, also wrote huge chunks of the kernel networking, tons of network card drivers, and a network channel bonding implementation for Linux.

    My point is that you can create a cluster with thousands of nodes, but doing so is an administrative and technological nightmare. For most parallel problems, it's much easier (and generally more efficient) to have a smaller number of more powerful nodes.

    odds of being killed by lighning and
  • Argh!!! I meant to say under three quarters of a million... It's not even early morning, so I don't have that excuse...
  • by jjr ( 6873 )
    This pure processing power here but the biggest problem is the fact when you have lots of data hardrives are your biggest enemy. We need faster hardrives. When I see the harddrives speeds reach the the level of processor speeds then we will really be cooking []
  • I wonder if there are _any_ Microsoft clusters in that top 500 and if on some miraculous chance there are where they stand... Hmmm, 64 of each running side by side - a new benchmark for Mindcraft to screw up? :)

  • No, it wouldn't, unless you put all of your program into the kernel. You don't want to pay the price of going into and out of the kernel.

    Ideally you'll implement everything userlevel, including your networking, so you never go into the kernel.

  • The UP machines might have a bus for each CPU, but the SMP machines' busses are still a lot faster than having to go through Ethernet for inter-CPU communication. Two SMP machines will beat four UP machines any day, at least until we get faster networking (or more importantly, lower network latency).
  • Since these are 733MHz PIII machines, I suspect they have 64bit PCI as well. That gives you 266MB/s bandwidth on the bus.

  • by BJH ( 11355 )

    If you bothered to reag the RC5 FAQ, you'd see that they've been asked this question enough times for it to have an item in the FAQ all of its own. They say, basically, that the RC5 distributed network is pretty much the same as a Beowulf network in terms of processing efficiency for RC5, so a Beowulf would perform much the same as the same number of machines operating independently on the project.
  • This isn't true scalability in the way that server manufacturers mean. In that, I mean you can't just take MySQL, Apache, Oracle, or any other application and code it to Linux's API's and then install it on a Beowolf cluster and watch it magically take advantage of all the nodes. This is a completely different beast, whose only commonality with the desktop and server Linux OSes is the fact that there is a common kernel running on each machine.

    Microsoft could do the same thing, install the NT kernel on a bunch of machines and run a program on top of it to actually pool machine resources, if they felt like it, but that's not at all their target market. Really, no pure OS vendor is going to approach the super computer market. There's no demand for just OSes... people want machines. Since people are kind enough to develop Beowolf and Linux, vendors can play around with the idea of using Linux clusters.
  • Yeah, we'll be cooking bacon, eggs, steaks, and whatever else you dare leave in the same room as your nifty 100,000 RPM hard drives.

    Really, there are so few applications out there that are actually hard disk intensive. And for those, it's actually beneficial to use disk arrays for the redundancy.

    Everything else can benefit from more RAM to avoid having to swap to the disk. RAM's solid state. Hard drives aren't. Don't expect them to ever have more than a very small fraction of the speed of RAM, cache, etc...
  • AFAIK, Mosix can run over any TCP/IP, so ethernet included.

  • Just thought I would mention that a cluster of Linux computers is by definition a Beowulf Cluster.

    Not completely, there is also mosix [].
  • Thats why I wrote "IMNSHO", and I didn't think long enough about it. You're absolutely right.

    My background is variational methods (the theory of it) and I should have been more specific in narrowing the applications down.

  • numeric applications of all kind for solving partial differential equations

    finite elements, particle methods esp. in 3-dimensions (very computing intensive), finite difference (nobody uses that anymore IMNSHO), whatever they use in molecule modelling.

    For what is that used:

    crash simulation (cars etc.), airplane design, engine design, meteorologie (- very ugly because of 3-d problems) etc, fluid dynamics, nuclear weapons simulation (- somewhat ugly for other reasons).

    There's no way there will be ever enough computing power even for very non esoteric applications. For instance imagine doubling the density of weather stations in three dims, this results in 8 times more input-data which causes 64 times more computing (theoretically) and perhaps much slower convergence of your solving algorithms.

  • IMHO, SMP is going to run out of steam quickly

    I belive you're wrong there. Although beowulf clusters are nice for applications that are computationally intensive but don't require much bandwidth, a lot of problems require a lot of bandwidth. This is where SMP boxes rule since bandwidth between processors on a single motherboard/backplane is a LOT larger than across a network. For these types of problems, the time required is more function of bandwidth rather than cpu so a smp box with fewer processors but more bandwidth between the processors will beat even a large beowulf/clustering solution.

  • yup. top500 []

  • We maintain a 64 node Sun Ultra5 cluster at SUNY Buffalo. It uses Myrinet for interconnect, utilizing eight 16 port switches, with four of those each connected to the other four with a single link. It scales REALLY well-for Linpack, we get abuot 85.5% of the single processor speed no matter how many processors we use. I can't imagine that going down very much if we add more boxes. There are Myrinet installations out there that have 1000+ nodes (many search engines use Myrinet), so it's got to scale well for 64+ nodes.

    Of course that's only an educated guess on my part. If I'm incorrect, note that Myrinet is coming out with single unit 64 and 128 way switches later this year. That should help improve the interconnect situation for larger clusters a great deal. Prices will be dropping too, possibly putting Myrinet in reach of groups with smaller budgets.


  • One dual processor SMP box isn't necessarily better than two UP boxes. Contention for memory and the network card are going to be big issues. Using myrinet with MPI on an SMP box isn't as good as it could be at the moment, either. MPICH-GM (MPI w/ support for myrinet) doesn't support communication between processes on the same machine using shared memory-they have to go to the myrinet switch and back. This should be resolved soo, though.


  • Sure. There are basically two major kinds of supercomputer: shared memory (e.g. SGI's Origin2000) and distributed memory (e.g. IBM SP). Beowulf style clusters fall into the distributed memory category. More expensive interconnect (e.g. myrinet) starts to approach the speed & latency of that in commerical offerings at a lower price.

    Tasks such as 3D rendering are not very communications intensive, so a beowulf-style machine with processors that compare to those in commercially available machines will run at about the same speed. Communication intensive tasks, such as meteorological simulations, don't run as well unless you shell out the big bucks for better interconnect.

    The linpack benchmark, which solves a system of linear equations, is used to determine ranking on the top500 list. See the website for more info.


  • The theoretical peek performance for one 733MHz PIII is 733MFlops, as it can do one FP op per clock. That's a TPP for the whole cluster of about 375GFlops. Now let's say it scales as well across 256 nodes as it does on our 64 node cluster that uses myrinet (85.5%). That gives 320GFlops.

    Unfortunatly one chip isn't actually going to produce 733MFlops on Linpack. A PIII-500 gets about 200, which is 40% of the TPP. Dunno much about the 733Mhz chips (except the cache runs at full processor speed, but if it's only 512kb it's not going to offer much improvement), but I'll be nice and say it gets 75% of the TPP. Ok, I'm probably being REALLY nice there. That leaves us with ~240GFlops.

    Of course, for a press release 375GFlops looks alot better :-)


  • How about the fact that your post and the other post that pointed this out have both been marked redundant, which, while not inaccurate, defeats the entire purpose of having a redundant rating.
  • Or in typical slashdot fashion clustering+linux=beowulf, regardless of the realities of the situation.
  • Leaving out harddrives, etc. would just slow you down. Extra load on your communications channels is not something you want in a distributed mp machine.
  • Am I the only one around that doesn't know what kind of a machine a "Netfinity 256" is?

    In fact, I took a glance on the IBM web site and I couldn't find any such machine...

    Am I blind or simply stupid?
  • Most likely the high speed networking hardware they refer to are not just an off-the-shelf 100baseTX switch and a bunch of ethernet cards. If I'm not mistaken, the servers themselves have very fast communications among processors inside them; the special networking software quite likely takes advantage of the network topology, and some problems (including the test problem they ran, I'd guess) may be well-suited for this type of machine. Communications latency is the bugaboo for high speed computing, so this may explain why they were apparently able to do so much with so little.
  • Wouldn't an implementation of PVM or equivalent in the Linux kernel improve the performance of parallel computing greatly?
    There is a kernel-httpd in the development branch of the Linux Kernel already - So wouldn't a feature in the kernel to enable parallel computing or clustering be the next big step? :-)
  • Well, I think the new Netfinity boxes should have some 64/66 slots, as well as some 64/33 slots... the 66MHz would buy you a good bit of performance, since you can transfer the data just that much faster (sometimes you have to wait for the bus, after all...)
  • the IBM cluster is 256 boxes, so if they are SMP, you have 512 procs

    Assuming they are using the Netfinity 8500R (8-way SMP), there could be a total of 2048 processors.
  • well I am not sure how big a pentabyte is so
    I went with 10K Gigabyte which means all you need is 134 of those new 75 Gig hd's
    So how about a 140 dual processors alpha systems with 500Mb ram, 75 Gig hd, dual fiberchannel
    networking with a cray as the controlling host.

    Oh pardon the drewl. ahhh ahah PORK CHOPS!!!!!

  • I think there is a significant clue that it is a Beowulf or Beowulf derivitive in that the name of the machine is Los Lobos, 'The Wolves'.

    Of course it could just be a refence to the UMN Basketball Team The Lobos []

  • Gigabit should do the trick. I believe that 32-bit, 33 MHz PCI is 132 MB/s. Gigabit is 100 MB/s, so as an off-the-shelf solution it would be pretty close to any proprietary one you could implement. 64-bit 66MHz PCI is another matter.
  • Shut up. If you are going to speak bollocks, only speak it in your pants. There is nothing in the open source principle that says communism. Just as there is nothing in the US principle of free speech that implies you're borderline communists either.
    AND people have been making money out of Linux, certainly more money than Amazon makes.
  • Just thought I would mention that a cluster of Linux computers is by definition a Beowulf Cluster. It was the term used to describe them, which is why M$ came out with thier version of doing the same thing with NT...WolfPack. Don't u just hate those guys?

    Byw, if anyone wants a copy of Extreme Linux (the beowulf clustering CD) that was made a few years back, drop me a line at
    I'll give you a place to download it, or if you send some $$$ I'll send you a CD copy of it.
  • Unless Mosix uses another type of data transfer than ethernet, I am pretty sure it is a Beowulf cluster.
    Parallel Processing over Ethernet is a Beowulf Cluster.
  • well, if you want to be semantic, a beowulf cluster of beowulf clusters is kind of redundant.

    Instead of referring to the post itself as redundant, maybe the moderator was referring to the content of the post. Oh heck, I don't know.
  • Check the stats for Team IBM []

    Dunno if this cluster is being used in the effort, though.
  • I wonder if 64 RS/6000s running LinuxPPC would have a better or worse price/GFLOP ratio?

  • The Ohio Supercomputer Center has a similar cluster here []

    Though I don't see anything about its speed in terms of gigaflops

    And here's one press release here. []
  • But how fast will it crack RC5? Does the processor underlying all this have that nice shift-n in one clock cycle?
  • Maybe if you'd bother to read the article...

    The Los Lobos supercluster is part of the National Science Foundation's National Computational Sciences Alliance program, which gives scientists remote access to the fast machines needed for scientific research.

  • Interesting that the Wired article claims that the machine has 64 nodes but that UNM press release says 256. I wonder what else Wired got wrong...

  • The latest copy of the paper I have seen is a pre-print, so even if I did know the specific details of the experiments the authors used, I would be uncomfortable going into details.

    I do know that they compared the NCSA NT Supercluster with the Linux cluster at Sandia National Laboratory. This is the second "large" cluster within the Alliance and sort of provides a counterbalance to our NT cluster.

    You are quite right in that there should be no big difference between operating systems for applications that are largely computation intensive. The big differences would come from applications that are heavily file intensive or communications intensive. Both clusters use a high performance interconnect called Myrinet, but the NT Supercluster uses a messaging layer developed here at University of Illinois called Fast Messages which provides very low latency and high bandwidth. The last I heard, the Sandia Linux cluster used TCP/IP over Myrinet and I do not believe this offers as low of latency as FM.
  • At no point in my message do I believe I implied that you should favor NT over Linux. I simply pointed out that it is possible to deploy clustering technology on NT as well as with Linux. I'm well aware of what NASA is doing with clusters; my funding comes from them.

    One reason you might want to use NT as a target for deploying clustering technology is that it could be argued to be more widely available than Linux (right now). If I extend my definition of a "cluster" to include machines on peoples' desktops, then if my solution can utilize the operating system present on many of these machines, perhaps I can do something interesting. The fact is, people DO pay money for NT, so building technology on top of it may make some bit of sense.

    Also, there is a Linux version of our clustering software available. It just happens that the NT version was developed first.
  • To be honest, keeping the machines up for long periods of time has never really been a problem. My NT desktop system, on which I do a lot of developement and other "weird" work, can easily have uptimes of 2-3 months, which is about all I really demand out of a desktop OS. (This is in distinct contrast to my wife's home machine which runs Windows 98 and crashes multiple times per day.)

    But, your point about remote administration is quite valid. Actually, I'd extend this to being "any remote access" to NT as being the crux of the problem. Launching jobs on a group of remote NT machines and getting the stdout back to the submitting machine has definitely been a pain with this project. With a Unix operating system, you could simply rsh each of the components of your parallel job. With NT, we ended up using a piece of commercial software that provides this functionality, although there are issues with it that make it not as nice of a solution as you would have out-of-the-box with Unix.

    Another related idea about NT being oriented to "the user sits at the console" as opposed to using the machine from remote comes when a job abnormally terminates. The NT debugger, Dr. Watson, tends to pop up and wait for the user to press "OK" to continue. If you're not sitting at the physical console, this is a problem. We've had to use some workarounds to deal with this situation.

    So, all-in-all, while there have been some studies that suggest NT performs slightly better for some types of scientific applications and some interesting results have been obtained by using NT, there have been many many days when I have really wished we had used Linux instead. Fortunately there are BOTH NT and Linux clusters, so scientists can choose which one they think will work best for their particular application.
  • Even though I know I shouldn't respond to a troll, I'll do so anyway because so many others get it wrong.

    The key to the problem is that you are comparing clustering to SMP, if I can put a 256 proc SGI O2k box out, it's an order of magnitude faster than a clusterd Intel box for most applications. To cover the rest of those applications introduce clustering, both SGI, SUN have clustering software.

    So, you take advantage of the extermely fast SMP capability in your Sun or SGI and use the flexibility of clustering.... lets say you combine 5x 256 CPU SGI Origin boxes, and you've got a kickin' option because they only have to go over the EXTERMELY slow network bottleneck when they have to go between 5 machines, instead of going over the EXTREMELY slow network to go between the 1280 single proc Intel machines (or 160 8 way SMP boxes).

    I'm not knocking Beowulf (side note Beowulf is NOT an IBM product), it's great for specific applications that need lots of power for low cost. But the ability to have 256 proc's access the same memory at extereme speeds is something that can't be ignored.

    Hmmm... wanna rethink that light years statement?

    Spelling and grammar checker off because I don't care
  • hey... back off about the amiga there buddie.

  • Well, this is not a very tech article. It doesn't say how fast is the high speed network.

    Anyway, I'm managing a 60 alpha XP1000 cluster connected with a 100Mbit network and I'm using a Cisco Catalyst. If you check their page, you'll see that the Catalyst 8500 can have 128 port switch for 10/100 Mbit. While only 64 for Gigabit connections.

    Indeed you should see how much your parallel jobs are communicating with each other and how much traffic they have to support. Sure you don't want to spend a fortune in gigabit connections if you're sending out very few packets.
    I'm pretty confident that a 100Mb will do for most of you out there!

    Check this page, for some more on the LosLobos! paaffair/news/news%20releases/Mar21hpcc.html []
  • Indeed they are 256 dual processor 733 MHZ Intel IA-32.

    Check the UNM press release at paaffair/news/news%20releases/Mar21hpcc.html []
  • Ahaha... Well then, they are no doubt fast, and faster than the 70 machine DEC Alpha 533 for sure... But still... Fastest cluster in the world? Faster than 1000 p2 350's? Well, let's do the stastical math on the clustering, and I believe it would say no... (70 x 2) x 733 = 102,620 vs. 1000 x 350 = 350,000. Even though those boxes are dual processor, the speed gains aren't that impressive when you realize that both processors share the same BUS (unlike the coming Dual Athlon boards, where each processor will have it's own Bus to the North Bridge) whereas in this cluster, though there are many more separate processors, they all have their very own Bus. Intel's processors, like the company itself, don't share nice and don't play well with others, so I can't be impressed by the fact that they have 140 processors, limited to half their transaction rates by having to share a BUS for concurrent processes, though I'd take one of the boxes, or the whole cluster if IBM wanted to give it to me.
  • "This low-cost supercomputer will allow researchers and developers access to computational power they previously could not afford," according to this yahoo story [] .

  • Apart from (maybe) calculating in different currency's here, there's the issue that for a node in a beowulf cluster, it is not neccesary to have RAID (== costly), or even a harddrive. Nor is it very likely that they used dual processors. Because of the crappy multiprocessing with x86 it would hardly be of benefit over using only single-processor boxes, while being much more cost-effective.

    If they're not using dual procs, then I find it hard to believe that it's more powerful than this. []

    Quote from site: The FSL cluster (called "Jet") currently consists of 276 nodes, organized into three long banks. The nodes are unmodified, off-the-shelf Compaq Alpha systems with 667 MHz processors and 512 MB of memory.

    ...but then again I ain't no hexpert. Anybody care to comment?

  • The original number is correct. 375 floating point operations per second is laughable. A pocket calculator can probably do better.
  • I doubt you could actually get a NT cluster to all be up at once...
  • I'm only partly sarcastic when I ask just how one keeps 128 NT computers up all at once? Based on NT's standard downtime ratio I'd figure you'd spend all your time running from machine to machine rebooting them. I mean you can't properly administer an NT box remotely ... someone needs to hit CTRL-ALT-DEL after all. ;-)
  • Lists the top 500 super computers in the world. ASCI Red, ASCI Blue-Pacific SST, ASCI Blue Mountain are the top three. ASCI=Accelerated Strategic Computing Initiative.
  • 24th doesn't sound bad to me, it's a start anyway.

    From the article they say that clusters max out at 64 machines, limiting their size - but also it's claimed that the cluster acts like a single machine, so my question is, why can't you cluster the clusters to use 4096 machines. Is it simply a case of (lack of) bandwidth linking the machines together?

  • Not to mention High Availability Linux, and the Failover protocols being developed as we speak...
  • That didn't answer my question. A scientist doing research... What kind of research?
  • It's great that IBM can manage to put together a cluster this large in capacity, however, is there any usefulness in this project? Are they doing any heavy number crunching with it or is it just a million+ dollar showpiece IBM can brag about? I'm curious as to know what they're going to use this for... something scientific? OR just a big ass rc5des machine? :-)

    Is there a list of these computers to see what's ranked 24?

  • I have a friend at SGI that said SGI's bid for this was almost at cost and it was like $1 million. IBM is losing a lot of money on this deal but they get the publicity and the fact that all the geeks at UNM will see big blue's logo all over it.

    Also, the cost of the boxes is almost unimportant with something like this, you have to take into account the actual construction of the network (usually the most time-consuming part of any super computer) and the main cost is the support contract. You have to have people available to fix this guy on a moment's notice whenever it breaks.
  • I guess that would depend largely on the software run, however I believe that SMP will be useful on the nodes; simple SMP systems are dirt cheap nowadays, and they cost actually less than two UP boxes (1 powser supply, 1 MB, 1 bus, 1 Network interface, etc ...)
  • How about something like Myrinet []? Iffen you gots the $$$ ;)

    Your Working Boy,
  • Only 24th in Supercomputer rankings, but what do those 64 Netfinities cost? I'm pretty sure it would come in at under the cost of most of the top 500; I'm estimating 10,000 UKP per box, which is under a quarter of a million. Not bad. However, it must be a bitch to manage 64 seperate nodes to make a single 'unit'

    Also, it mentions the limitations of networking; can't you link together 3com (now defunct, I know) switches in a stack to make larger switches? If not, I'm pretty sure that we'll have larger switches in the next few years.

  • Couldn't see a spec in the artical, but we have a couple here (dual Xeon 500, 1Gb ram, RAID) and their list price is around 16k.

    and just to be picky 64x10k=640k or over half a million!

  • Because the joke is so damn obvious. Christ sakes my mother could have thought of that one.
  • What possessed someone to mark the above post as 'flamebait'?

    I literally read it 3 times through to find the 'flamebait' there: nada. The only moderation down, which could have had a sliver of merit would have been 'overrated' but this is ridiculous. Hopefully this gets caught in meta-moderation...

  • You're making a category mistake. MPI and PVM are message-passing libraries. LINDA is a programming language that uses a tuple space stored in distributed shared memory (see here [] for more info. HACMP is a completly different beast, see IBM's homepage.

    Beowulf != any of these. Beowulf is the idea that one can take commodity, off the shelf (COTS) components and build a powerful machine at a price far less then a comparable commercial offering.

    Codes run on Beowulf, and really any parallel machine, typically use MPI, PVM, or custom message passing libraries. The beowulf idea includes the use of MPI & PVM, among other freely available software packages. Codes that run on shared memory machines typicall uses the shared memory device of MPI, shared memory, or pthreads.

    For CPU intensive tasks the Beowulf idea is great. Codes that perform lots of disk I/O suffer, as adding higher performance (i.e. SCSI) disks increases system cost greatly. Communication intensive tasks perform the worst on beowulf style clusters compared to commercial computers, as the interconnect on beowulf-style clusters can't compare. For a relatively large increase in cost, one can use Myrinet []. With Myrinet bandwidth and latency begin to approach that of the switch found on the IBM SP series of machines.

    With high bandwidth, low latency interconnect technologies that scale well (e.g. Myrinet), one can build a cluster that outperforms a comparable commercial offering at, say one quarter to one eigth the price. The difference at that point is software. There's really not alot out there to configure and administer beowulf-style clusters, and commercial implementations of some packages beat the pants off of their freely available counterparts (compilers, for example). Until the software situation changes there is still reason to buy your big iron from IBM, SGI, and Sun.


  • Actually, I'm not making a category mistake (I even noted in my original post that the things I was using as examples were not necessarily interchangeable), but you've made my point for me.

    As you note, the real power of distributed/parallel computing comes from the message passing libraries, most commonly MPI or PVM. Beowulf per se is almost nothing more than a label for the generic concept of distributed computing on Linux. The same thing can be done with any other reasonably modern networked computer you have lying around, even those running Windows - you can even mix OSes in a cluster, although this introduces new and interesting problems. (There are a serious lot of underutilzed cycles sitting out there on the corporate world's desktops if they're not running OpenGL screensavers...)

    BTW: If the phenomenal success of Sun's E10000 Starfire has taught us anything at all, it's that where I/O is important, a big honkin' SMP box kicks cluster butt! Seriously, the interconnect technology between boxes just *can't* be fast enough to compete effectively with a huge multi-level crossbar packet switch like the ones in the E10K. Sun and the other SMP vendors can win here because they own the domain in which the simpler problem resides...

    Don't assume by this that I'm against Beowulf clusters at all - they are a great and amazing thing, but there's more than one way to skin a cat, and Beowulf isn't the only path to Linux distributed computing.
  • The article doesn't say, but despite what you'd think by reading the rantings of the ill-informed 3l337 d00dZ on slashdot, Beowulf isn't even a very good clustering technology for most problems.

    There are far more serious, industrial-strength solutions out there, things like MPI, PVM, LINDA,and IBM's own HACMP. (Note these cover a lot of ground and are not necessarily even comparable to one another.)

    Beowulf (or any of the others listed above) is not automatically the correct distributed computing methodology. Selecting the proper solution for the job at hand is far more complex than you might imagine. There is a lot more developer activity on some of these than there is on Beowulf - MPI in particular is maturing rapidly and is used for solving big/tough problems in many of the largest companies in the world. (No particular MPI advocacy or bias, it just seems like I run into it more often than the others...)
  • The only thing I could think of is that the moderator was an SMP designer and took offense to the "SMP is going to run out of steam" comment. Otherwise, I can't see any there either...
  • by Tower ( 37395 )
    If you need it, get Ultra3 SCSI with solid-state drives. Sure it'll cost you an entire lifetime's salary, but hey, they're great.

    Really though, using a solid state drive as a cache for a disk subsystem is an easy way to enhance performance, and is already being sold. You perform a write - instant gratification, and wiht proper caching algorithms, you can get the same thing for reads. A multi-gig SS Drive can easily max out a bus. Multi-level caching is a necessity as speeds increase in systems.

    In this sort of system, the interconnect fabric (as fast as it is) can still be a little bit of a bottleneck, too... A good cached RAID disk system on the one end can really keep things smoking, though.
  • Other reports have it listed as 256... /~paaffair/news/news%20releases/Mar21hpcc.html []

  • Just point him at

    Of course, it's a little more technical than just "a bunch of computers hooked up via high speed links (i.e. fast/gigabit ethernet) to provide a parallel solution for complex calculations", but there is a lot of info...
  • Darn barrell shifters are so expensive in hardware (and so rarely used, aside from some specialized apps). The processors are Xeons, nothing special there (aside from full speed L2). So the RC5 speed would be ~the same as any group of machines with this proc/speed (since it isn't a network intensive computing project). Grab and crunch... crunching takes far more time (several orders of magnitude if done right) than grabbing, so it's not really a situation where beowulf clustering even helps... a second or two to d/l a new keychunck, and, depending on the size of that chunk, anywhere from a few minutes to a few weeks to finish it (even with mighty Xeons)... nice as a distributed app, no gain from a cluster.
  • True, but they are using 2-way boxen... s/Mar21hpcc.html

    "The National Computational Science Alliance (Alliance) will take delivery of a 512-processor Linux supercluster within the next month - a move that will give this nationwide partnership the largest open production Linux supercluster aimed at the research community. The new supercluster, called LosLobos, will be located at the University of New Mexico's (UNM) Albuquerque High Performance Computing Center (AHPCC), one of the Alliance Partners for Advanced Computational Resources sites."
  • Well, if you can keep most of your working set in the 2MB L2, you can stay off of the bus more often, so there won't be so much holdoff or contention with the other proc... but a switched fabric is better for SMP than a bus... far more costly, though. Can't wait for a nice dual athlon board my self... Where's the 70 from? Looks like you are referencing the Alphas there (which use the same bus as the Althlons)... the IBM cluster is 256 boxes, so if they are SMP, you have 512 procs, and like I said, if your software is designed properly, you shouldn't be heading to memory constantly anyway (though this *is* unavoidable sometimes).
  • Huh? Finite differences are still very widely used; most (like 80%+) of CFD is done using finite differences. Actually, in any model that you're examining a finite volume of space, rather than an object, you're most likely using finite differences (i.e. fluid problems, including AFAIK weather simulation).

    Finite elements are mainly used for objects/mechanisms, particularly structures analysis (incl. car simulation) and manufacturing (metal stamping).

    engineers never lie; we just approximate the truth.
  • There are 4096-machine clusters, but the marginal performance gain per machine drops for many kinds of computation because of the 'segmented' archictecture. That is, the limited bandwidth inter-cluster relative to the bandwidth intra-cluster makes programming the whole cluster a problem of finding a 64-way very loosely-bound approach, each segment of which is 64-way loosely-bound (each segment of which is N-way tightly bound because of the possibility of SMP). Finding algorithms to split up problems this way is very difficult, and in some cases, impossible. For a given problem, there is a network width/speed for which the limiting factor is the processor speed (e.g. you're not losing performance to overhead); this is the case for many more problems at switched N bps than at shared N bps (in/out-side a cluster).

    On the other hand, there are algorithmic techniques for masking (network) latency (e.g. time-skewing), so it's possible to make better use of 'loosely-coupled' (relative to the algorithm's interconnect requirements) compute elements (machines/clusters/etc) than you may think.

  • I really think that more attention needs to be payed to clustering technologies. I first got started with my cluster [] about 18 months ago. Beowulf was fairly new, and almost considered to be a black art. After reading and tinkering for a few weeks, I was amazed at how easy it actually is to get things running in parallel. Now, don't get me wrong, like many things its easy to do but hard to do *well*...nevertheless, I'd sure like to see more activity in this area. (IMHO, SMP is going to run out of steam quickly)

    All you slashdotters with three or more systems in your basement! Go! Get them networked! Load MPICH or PVM! This should be your mantra:

    "If SuperID can do it, then I certainly can"
  • So to answer your question, no. MS clustering is really more for failover then load balancing. The load balancing works but not nearly as well as you like. The basic problem is that Win2k/NT is not designed to cluster at its core. Linux can be made to do that and thus has a destinct advantage.
  • Interesting.

    Were the Linux cluster users using gcc/g77? It is well known that (at least for most scientific codes) you can get 50-100% speedup by switching from the GNU compilers to commercial ones from Portland or elsewhere.

    If there is still a difference, then the next thing to try is the latest dev kernels, which have better SMP (if SMP nodes are used), and significantly faster disk io through the elimination of double caching.

    Since most scientific apps should spend most of their eating user CPU cycles, I wouldn't expect there to be very much difference between one OS and another, however node uptime and more established remote admin are points in Linux' favour for big clusters.
  • It doesn't seem like a particularly techincal article, so someone probably just glossed that particular detail over. Your average Joe won't remember it anyway and someone may have been trying to avoid confusing the reporter.
  • Why does the "Doesn't Scale Well" MS FUD persist in the face of stories like this? Is there an NT cluster anywhere in the top 500 much less the top 100?
  • I dunno... My guess, if any computer were to download all of the internet, and be able to categorize it, it would be cracked of all its porn.

    Just one man's guess...

  • The inter-machine communication requirements for genetic programming are low. Basically each machine can operate independently, and at the end of each generation, transmit only the fitness of each individual evaluated back to the server - very little data has to be sent over the network. See l

    In fact, they're using a very standard 100Mbps ethernet, while my impression is that the IBM supercomputer will be much more tightly coupled. The GP cluster is rate at 0.37 Thz, about the same as the IBM machine.

    On a side note, genetic programming will be an ideal for; it's disappointing that GP isn't being attempted there.
  • Now, these Netfinity machines at best are 550 MHz Xeons (according to the best model I saw in IBM's Homepage) so I seriously doubt they outrun the Beowulf cluster of 1000 Pentium 2 350MHz machines (and the controlling host) being used for Genetic Programming [], were they in fact clustered together to achieve the type of speed benchmarks IBM was after, rather than being used for a useful purpose as these are. Do they even outrun the 70 machine cluster of 533MHz DEC Alpha's that had previously been used for Genetic Programming? I doubt it. I tried to submit the Genetic Programming thing as an article once a while back, but it was rejected for some reason or another, even though someone posted about it in a forum like this once long ago, and people seem to continuously forget about this amazing cluster and what possibilities it presents to the computing world. Imagine if you told it to try to create a better version of itself? Once we have the storage capacity (the Petabyte, theorized to be necessary to store the totality of a human consoiusness) what would happen if you give it a pipe to the internet and told to to absorb data, correlate it with data it already has, "remember" or "forget" the data as is considered relevant based on things it already "knows"? Anyway. that's beyond the point... Which is this IBM cluster isn't amazingly new or ground breaking at all, and I have to doubt IBM's claim as fastest.
  • Come on, IBM! We want to know how fast this thing is in BogoMIPS [].
  • Excuse me? "Darn barrell shifters are so expensive in hardware"? Have you ever even seen a hardware VLSI design tool? Have you even heard of Verilog? In terms of hardware cost, a barrel shifter takes much less space than a fast carry-save or carry-branch adder. Flinging bits around is something that hardware is very good at doing cheaply. Arbitrary permutations basically boil down to renaming the inputs by shifting the output positions. While this is not exactly easy to implement for the general case, shifting or rotating bits, especially if the size of the object being rotated == the natural word size of the processor, is absolutely trivial. Even the naive implementation (selection tree ~5 clocks, forward, permute, issue) takes only 8 clocks, is easily pipelined, and takes marginal space.
  • by Anonymous Coward on Wednesday March 22, 2000 @05:11AM (#1184135)
    While I'm all in favor of breakfast cereal based supercomputers, particularly the linux driven variety, like everything else, there are many unforseen problems in implementing such a device. I've built seven such clusters myself (including a personal 60 box cluster of old 486 and P1s. It's mighty nifty), and each time I run into the same fscking problem.


    That's right, squirrels.

    It has nothing to do with software conflicts, or processers overstepping each other. That can all be taken care of with a little bit of clever coding and hacks/workarounds. But that doesn't take care of the squirrel problem. Everytime I finally work around all the network tomfoolery, and get the power for umpteen boxes managable, like clockwork the squirrels come.

    And they come not in single spies, they come in batallions.

    It's never the same. Sometimes they just chew the wires. Sometimes they try and make off with a box or two. What the hell does a rodent need with a computer?!?!? Wait, I don't want to know the answer to that. My repeated attempts at hunting down and exterminating the wascally bastards are met with comic hijinxs and failure. And as far as I'm aware, there aren't any open sourced squirrel repellant systems. I can't trust a proprietary system to not conflict with the many many tweaks I've made to the system. But alas, I'm stuck with my ACME catalogue and a variety of clever devices which only fail and fail again, each one making me look successively worse.

    So let's hope IBM can manage a good rodent-security system, and release it back into the community. God knows I've tried. I'm sure they will realize the importance of this issue after the first few attacks. This is a much overlooked problem, but we need a solution. And as soon as possible.

  • by Anonymous Coward on Wednesday March 22, 2000 @07:56AM (#1184136)
    A better account of what's going on -- complete with a description of how Beowulf is used -- can be found at 6/1/
  • by BackSpace ( 41879 ) on Wednesday March 22, 2000 @05:46AM (#1184137)
    it is on []
  • by Greg Koenig ( 92609 ) on Wednesday March 22, 2000 @05:50AM (#1184138)
    You jest, but actually there is an NT supercluster within the National Computational Science Alliance. See here [] for more information. I was part of the original group which developed this clustering technology while it was a research project in the computer science department. Now that it has been deployed as a real computation resource, one of my projects is to make it available to the national computational grid which the Linux article discusses.

    Deploying an NT cluster was certainly a challenge in some ways that would have been easier with Unix, but not impossible. Some of our collaborators have published results favorably comparing the performance of the NT supercluster to that of Linux clusters, so there seem to be good reasons to continue building at least some technology like this on NT.
  • by 348 ( 124012 ) on Wednesday March 22, 2000 @04:58AM (#1184139) Homepage
    A Beowulf cluster of these. . .

    Sorry couldn't resist.

  • by Apps ( 21158 ) <appelbe@yah[ ]com ['oo.' in gap]> on Wednesday March 22, 2000 @05:04AM (#1184140)
    There is no mention of Beowulf in the article,
    just "special clustering software"

  • by 348 ( 124012 ) on Wednesday March 22, 2000 @05:41AM (#1184141) Homepage
    Well to really drive this redundant topic into the gound. How about, maybe the moderator thought the beowulf cluster of beowulf clusters was kind of reapetedly redundant and then the repeated repeating of repeating the original repeating of the beowulf of beowulfs was being repeated and this was eyed, seen and viewed as redundant, then the moderating moderator moderated the repeated post outlining and stating in the verbage of the written post that in his/her point of view and from his/her mindset he/she thought that the repeated repeating of repeating the original repeating of the beowulf of beowulfs was being repeated and this was eyed, seen and viewed as redundant. So the moderating moderator moderated the repeated post outlining and stating in the verbage of the written post that in his/her point of view and from his/her mindset he/she thought that the repeated repeating of repeating the original repeating of the beowulf of beowulfs was being repeated and this was eyed, seen and viewed as redundant. This being said, the moderator more than likely tagged the repeatedly repeating repeat posts stating the redundancy of the redundancy was redundant.
  • by luckykaa ( 134517 ) on Wednesday March 22, 2000 @05:06AM (#1184142)
    it will only rank 24th on the list of the top 500

    I like the comment that its "only" 24th. As though being only the 24th richest person alive, or only having the worlds 24th fastest car would also be something to be sneezed at.

    Anyway, is there a list of world supercomputer rankings?

I've looked at the listing, and it's right! -- Joel Halpern