Forgot your password?
typodupeerror

Cluster Interconnect Review 64

Posted by ScuttleMonkey
from the sparring-with-scandinavian-warriors dept.
deadline writes to tell us that Cluster Monkeys has an interesting review of cluster interconnects. From the article: "An often asked question from both 'clusters newbies' and experienced cluster users is, 'what kind of interconnects are available?' The question is important for two reasons. First, the price of interconnects can range from as little as $32 per node to as much as $3,500 per node, yet the choice of an interconnect can have a huge impact on the performance of the codes and the scalability of the codes. And second, many users are not aware of all the possibilities. People new to clusters may not know of the interconnection options and, sometimes, experienced people choose an interconnect and become fixated on it, ignoring all of the alternatives. The interconnect is an important choice and ultimately the choice depends upon on your code, requirements, and budget."
This discussion has been archived. No new comments can be posted.

Cluster Interconnect Review

Comments Filter:
  • News? (Score:2, Funny)

    by Ramble (940291)
    Interesting article, but I'm not sure how many Slashdotters can fit a cluster powerful enough to saturate a GigE interconnect in their mother's basement.
    • Re:News? (Score:2, Funny)

      by Anonymous Coward
      My mom has a bigger basement than yours! :D
    • Most people don't get killed by terrorists but the news media sure likes to focus on it anyway.

      News media usually focuses on the exception, not the norm. Besides, I find clusters interesting.

    • At least one Slashdotter works on several clusters daily - his job involves benchmarking applications and tuning the hardware such that customers running specific applications can have an officially recommended hardware configuration that will help them to optimize their expensive hardware. He might be interested in reading about new cluster interconnects... /glad Slashdot is almost geared to my personal interests for once ;)
  • /.ed (Score:5, Funny)

    by xming (133344) on Saturday April 22, 2006 @04:43PM (#15181876) Homepage
    I bet they use the $32 interconnect for their server
  • Mirror (Score:5, Informative)

    by tempest69 (572798) on Saturday April 22, 2006 @04:49PM (#15181894) Journal
    http://www.mirrordot.net/stories/57bdef42b0ad596ff 35350041a22b442/index.html [mirrordot.net]

    because some people practice what they preach

    Storm

    • except that only the first page is mirrored, and the next pointers go to the dead site.

      if you care, use ib. the linux support is still a little funky, but in terms of application performance for the dollar, its hard to beat. tcp is gorgeous for sharing buffer space in the wide area, but its alot of work for a tightly coupled machine.
      • by Erbo (384)
        Yeah. The fact that only the first page of the article was mirrored means I didn't get to see their notes on Myrinet or InfiniBand...which are the high-speed interconnects that the company I work for, Aspen Systems [aspsys.com], generally use when building clusters. (This is in addition to a standard GigE network used as the "control" network, freeing up the high-speed network for application data. The part I'm responsible for, the management software, uses the control network exclusively.) I'm interested to see wha
  • by mg2 (823681)
    Where I work, I deal with 30-40GBps average read/write total throughput on our distributed filesystem using GigE and Cisco 6509s.

    I have trouble imagining an application that could eat up more than that. It's bananas.
    • Have you looked at the needs of Willie Wonka lately? He had to develop his own interconnect made out of midgets and chocolate. Now THAT is bananas.
    • Yes it might include bananas, or possibly some other fruit or veggie.
    • Re:GigE FTW (Score:2, Interesting)

      by multimediavt (965608)
      You're talking bandwidth in a read/write to a filesystem. You are not taking into consideration applications that are latency bound, or are both latency and bandwidth bound when passing information from node to node, let alone writing to a filesystem. We run a number of scientific codes on our IB-based cluster. Some of these codes are slinging around up to 20GB of data passing messages between nodes, and this is memory copies not filesystem read/wries. It has to be fast (lower the latency the better for
    • well you obviously aren't pushing 30 gigabytes (your capital B did mean bytes right?) per second down gig-e links unless you are running several hundred in paralell.
      • Just to make clear what is required...
        We can have it to 15 GB/s for most conversation (because he implied concurrent read/write, and most people just discuss unidirectional bandwidth)
        That brings us down to about 150-170 Gb/s required to measure a cross-sectional bandwith of 30 GB/s.

        So, say, a 256 node cluster running something like GPFS or Lustre even on gigabit ethernet might play in the realm of 30 GB/s concurrent read-write throughput. This is assuming nodes contributing their storage to a pool and not
      • Distributed read/write Pieces of files exist in multiple places and are needed by multiple processes in multiple locations Overall, I see average 30 gigabytes throughput across the cluster, not from node to node. The node to node speed could, of course, never exceed that of the transmission medium.
  • Nitpicking... (Score:3, Informative)

    by isj (453011) on Saturday April 22, 2006 @05:05PM (#15181935) Homepage

    "every time a packet gets sent to the NIC, the kernel must be interrupted to at least look at the packet header to determine of the data is destined for that NIC"

    Uhm.. no. That is only the case if the NIC has been set into promiscuous mode, or if it has joined too many multicast groups.

    "...because the packets are not TCP, they are not routable across Ethernet"

    Uhm. If they were IP they would still be routable. I suspect he meant "not IP".

    I also get irritated by the spelling out of "Sixty-four-bit PCI"

    But the article still has a lot of good reviews and a load of links to other sides with interesting info.

    • by billstewart (78916) on Saturday April 22, 2006 @05:32PM (#15182011) Journal
      I found the article's comments implying "Not TCP => Not Routable" quite annoying also, but I don't think he just meant "Not IP". Obviously if the application uses TCP or UDP it's going to have IP underneath and therefore be routable (unless you're doing some leftover-1980s hackery like implementing TCP over ISO CLNP or whatever.) And you could build an application that took a different flow-control approach than TCP that might be more efficient but still use IP and therefore still be routable (though usually people who want to do that sort of thing keep UDP and roll their own apps at Layer 7.)

      But he's probably talking about some kind of application that's intended for local-area application only and wants to avoid the overhead of TCP, UDP, and IP addressing, header-bit-twiddling, flow control, slow-start, kernel implementations optimized for wide-area general-purpose Internet networks, etc., and rolls its own protocols that assume a simpler problem definition, much different response times, and probably just pastes some simple packet counters over Layer 2 Ethernet, probably with jumbo frames.

      If you've implemented your ugly hackery properly, you still _could_ bridge it over wide areas using standard routers even though it doesn't have an IP layer. That doesn't mean it would work well - TCP's flow control mechanisms were designed (particularly during Van Jacobson's re-engineering) to deal with real-world problems including buffer allocation at destination machines and intermediate routers and congestion behaviour in wide-area networks with lots of contending systems, which a LAN-cluster protocol might not handle because it "knows" problems like that won't happen. Timing mechanisms are especialy dodgy - they might have enough buffering to allow for LAN latencies in the small microseconds, but not enough to support Wide Area Network latencies that include speed-of-light (~1ms per 100 miles one-way in fiber) or insertion delay (spooling the packet onto a transmission line takes time, e.g. 1500 byte packet onto 1.5 Mbps T1 line takes about 8ms, and jumbo-frames obviously take longer.)

  • though I would argue that there was too much time spent discussing GigE, and not enough on the performance and scaling issues seen with the more exotic cards.

    Not a technical issue, but a little note about the Infiniband cards reading, "Unlike the alternatives, just try to get information on pricing one of these without leaving all of your contact information for a salesman to use now and in perpetuity." I've been through this recently, and have considered (given the similar performance), purchasing Myri
    • by chemguru (104422)
      We've built a Sun Cluster with SCI for 10G RAC. It's not just about the bandwidth of the interconnect, but the latency. With 10G RAC, you can use SCI to allow shared memory segments between each node. Damn good stuff. Too bad SCI is rarely used.
      • It's always looked good on paper, but my one experience in the field has made me radically gun-shy. What are they like to deal with these days? I'd like to be able to run some codes such as CPMD and Siesta distributed parallel.

        and lurking there in the background, there's always NWChem....
    • Let me leave my impressions of the article out and just give a little tidbit about why the IB (and other high-end interconnect companies) don't publish their prices. The reason is, they really don't have fixed prices. They will almost always do a custom pricing based on the size of your cluster and whether or not you will be a repeat customer (based on the nature of your business, etc.). Don't be afraid to call these guys, their really not hounds that will bug the crap out of you like a software sales gu
    • Further benchmarks.. (Score:2, Informative)

      by keithoc (916498)
      ..can be gotten from the results [utk.edu] of the 2005 HPC Challenge [hpcchallenge.org] - real world results, no marketing blurb.
    • Try here [sun.com] if you want an idea .. my complaint was that the entire site was very linux centric .. there's some pretty good ideas going into Solaris [opensolaris.org] and with their big push in amd64 and i386, it can be a more affordable and stable platform to work with .. in fact if you want you can reference the Infiniband source here [opensolaris.org] or here [opensolaris.org] for example ..
  • We found that for disk intensive parallel computing that gigabit ethernet can be almost as fast as very, very expensive networking equipment. Of course our throughput requirements are small compared to many other types of applications. So try it cheap before you invest in the expensive networking equipment. You can always use the cheap stuff for login type work and distributed shells if you opt for the expensive equipment later.
  • What sort of fun projects could a home experimenter with a pile of hardware dive into? It sounds like all these machines are used for a fairly narrow set of scientific applications. Anything a non-academic would find interesting?
    • Not really (Score:3, Informative)

      by Junta (36770)
      One, the price of all this stuff is exhorbitant, and most home applications could barely benefit from going from 100 MBit to Gigabit. Realistically, getting 1.5 microsecond latency and the ability to transfer GigaBYTES per second has no home use right now. Really exorbitant High definition streams top out at about 20 MBit/second for 1920x1080 MPEG-2, and of course no game demands that much throughput. Hard drives for home use can only theoreticly dump out 300 MB/s or so anyway (SATA II), and realisticall
  • by Junta (36770) on Saturday April 22, 2006 @08:17PM (#15182517)
    Lately, the big contenders are:
    -Ethernet
    -Inifiniband
    -Myrinet

    I haven't heard much about SCI or Quadrics lately, and just these three have been tossed around a lot lately. Points on each:
    -Ethernet is cheap, and frequently adequate. Low throughput and high latency, but it's ok. 10GbE ethernet is starting to proliferate to eliminate the throughput shortcomings, and RDMA is starting to possibly help latency for particular applications. Note that though overwhelmingly clusters put together using ethernet use IP stack to communicate over it, it is not exclusively true. There are MPI implementations available that sit right under the ethernet header layer. It bypasses the OS IP stack which can be very slow and reduces overhead per message. Increasing MTU also helps throughput efficiency. But for now only 1 Gigabit ethernet is remotely affordable at any scale (primarily due to current 10GbE switch densities/prices, adapters are no more expensive than Myrinet/Infiniband).

    -Myrinet. With their PCI-E cards they achieve about 2 GBytes/sec bidirectional throughput, very nearly demonstrating full saturation of their 10GBit fabric. They also are among the lowest latency sitting right about 2.5 microsecond node-to-node latency as a PingPong minimum. Currently the highest single-link throughput technology realistically available to a customer (Infiniband SDR doesn't quite acheive it, about 200 or so MByte/s short, but DDR will overtake it as it realistically is available). Very focused on HPC and until recently also the only popular high-speed cluster interconnect that was very mature, easy to set up and maintain, and efficient. Now they are starting to embrace more interoperability with 10GbE, probably in response to the rise of infiniband.

    -Infiniband. Until very recently immature (huge memory consumption for large MPI jobs, software stack that is highly complex and not easily maintanable, and the prominent vendor of chips (Mellanox), didn't acheive good latency. With Mellanox chips you are lucky to get into the 4 microsecond range or so. With Pathscale's alternative implementation (particularly on HTX), the lowest latency interconnect becomes possible (I have done runs with 1.5 microsecond end-to-end latency even with a switch involved). The maximum throughput is on the order of 1.7-1.8 GByte/s and more importantly is one of the faster technologies in ramping up to that. No technology acheives their peak throughput until about 4 MB message sizes, and Pathscale IB is remarkably a good performer down to 16k-32k message sizes. Additionally, IB has a broader focus and some interesting efforts. They make efforts to not only be a good HPC interconnect, but also to be a good SAN architecture that in many ways significantly outshines fibre channel. The OpenIB efforts are interesting as well. The huge downside is that for whatever reason no Infiniband provider has been able to demonstrate good IP performance over their technology. This particularly is an issue because most all methods of storage sharing from hosts are IP based. SRP is ok for the little amount of flexibility that strategy gives to be Fibre-Channel like, but nfs, smb, and image access like NBD and iSCSI all perform very poorly on Infiniband compared to Myrinet. iSER promises to alleviate that, but for the moment you are restricted to performance on the order of 2.4 gigabit/s for IP transactions. Myrinet has been able to deliver 6-7 Gigabit/s for the same measurements. You could overcome this by sharing storage enclosures and use something like lustre, GFS, or GPFS to communicate more directly with the storage over SRP, but generally speaking some applications demand flexibility not acheivable without IP performance.

    And at the end of the day, I come home and run my home network on 100MBit ethernet, sigh. It is enough to run a diskless MythFrontend for HD content at least.
    • Just want to add... (Score:5, Informative)

      by Junta (36770) on Saturday April 22, 2006 @08:37PM (#15182582)
      For those not aware of how ethernet is limited latency wise regardless of what is done, I will explain a tad.

      Ethernet is well architected for large deployments (enterprise-wide) with the packet routing (not IP routing) done on the switches. Menaing a computer sending a packet asks its switch to get it to 0A:0B:0C:01:02:03, having no idea where it will go. Switch only knows it's immediate neighbors, and will check/populate it's arp table to figure out the next entity to hand off. This means switches have to be really powerful because they are responsible for a lot of heavy lifting for all the relatively dumb nodes. This is not TCP, it is not IP, it is raw reality of ethernet networking. Aside from Spanning tree (which is not maintained for any other reason than keeping a network from getting screwed over by incorrect connections, not for performance), no single entity in the network has a map of how things look beyond its immediate neighbors.

      IB, Myrinet, etc, are source routed. Every node has a full network map of every switch and system in the fabric. The task of computing communication pathways is distributed rather than concentrated (fits well with the whole point of clusters). node1 doesn't blindly say to the switch, 'send this to node636', it says to switch 'send this to port 5, and the next switch, put it out port 2, and the next switch, do port 9 and then it should be where it needs to be'.

      There are more complicated issues their, but a lion's share of the inherent strength of non-ethernet interconnects is this.
    • There is a growing consensus that Infiniband is effectively a dead-end. Last year it would have been a tough call between Infiniband, Ethernet, and other more proprietary interconnects. The market seems to be favoring the backward compatibility of Ethernet, and now that low latency Ethernet (~200ns) appears to be at hand there does not appear to be any reason to prefer the less general tecnologies (Myrinet, Infiniband, etc). My friends at Myrinet hint that they are looking to using the Myrinet protocol l
  • See page 3 of the Custering Software Product Description [hp.com]

    Cluster systems are configured by connecting multiple systems with a communications medium, referred to as an interconnect. OpenVMS Cluster systems communicate with each other using the most appropriate interconnect available. In the event of interconnect failure, OpenVMS Cluster software automatically uses an alternate interconnect whenever possible. OpenVMS Cluster software supports any combination of the following interconnects:

    CI (computer inte


    • Yes Digital invented clustering with OpenVMS. But since then the "clustering" brand has fragmented into HA (Highly Available) clusters and Compute Clusters like Beowulf (and variations on the theme). OpenVMS is a HA cluster, and still rules the HA roost because it does such an amazing job of combining shared storage with SSI (single system image) functionality. On the Unix front, only Tru64 has come close to OpenVMS, but sadly that's in decline and will very soon vanish.

      There also seems to be zero inter

Luck, that's when preparation and opportunity meet. -- P.E. Trudeau

Working...