Forgot your password?
typodupeerror

Cluster Interconnect Review 64

Posted by ScuttleMonkey
from the sparring-with-scandinavian-warriors dept.
deadline writes to tell us that Cluster Monkeys has an interesting review of cluster interconnects. From the article: "An often asked question from both 'clusters newbies' and experienced cluster users is, 'what kind of interconnects are available?' The question is important for two reasons. First, the price of interconnects can range from as little as $32 per node to as much as $3,500 per node, yet the choice of an interconnect can have a huge impact on the performance of the codes and the scalability of the codes. And second, many users are not aware of all the possibilities. People new to clusters may not know of the interconnection options and, sometimes, experienced people choose an interconnect and become fixated on it, ignoring all of the alternatives. The interconnect is an important choice and ultimately the choice depends upon on your code, requirements, and budget."
This discussion has been archived. No new comments can be posted.

Cluster Interconnect Review

Comments Filter:
  • Mirror (Score:5, Informative)

    by tempest69 (572798) on Saturday April 22, 2006 @04:49PM (#15181894) Journal
    http://www.mirrordot.net/stories/57bdef42b0ad596ff 35350041a22b442/index.html [mirrordot.net]

    because some people practice what they preach

    Storm

  • Nitpicking... (Score:3, Informative)

    by isj (453011) on Saturday April 22, 2006 @05:05PM (#15181935) Homepage

    "every time a packet gets sent to the NIC, the kernel must be interrupted to at least look at the packet header to determine of the data is destined for that NIC"

    Uhm.. no. That is only the case if the NIC has been set into promiscuous mode, or if it has joined too many multicast groups.

    "...because the packets are not TCP, they are not routable across Ethernet"

    Uhm. If they were IP they would still be routable. I suspect he meant "not IP".

    I also get irritated by the spelling out of "Sixty-four-bit PCI"

    But the article still has a lot of good reviews and a load of links to other sides with interesting info.

  • Re:The Codes? (Score:2, Informative)

    by gedhrel (241953) on Saturday April 22, 2006 @05:30PM (#15182002)
    "codes" used like that is a term for parallel software that's particularly prevalent amongst the number-crunching crowd.
  • by billstewart (78916) on Saturday April 22, 2006 @05:32PM (#15182011) Journal
    I found the article's comments implying "Not TCP => Not Routable" quite annoying also, but I don't think he just meant "Not IP". Obviously if the application uses TCP or UDP it's going to have IP underneath and therefore be routable (unless you're doing some leftover-1980s hackery like implementing TCP over ISO CLNP or whatever.) And you could build an application that took a different flow-control approach than TCP that might be more efficient but still use IP and therefore still be routable (though usually people who want to do that sort of thing keep UDP and roll their own apps at Layer 7.)

    But he's probably talking about some kind of application that's intended for local-area application only and wants to avoid the overhead of TCP, UDP, and IP addressing, header-bit-twiddling, flow control, slow-start, kernel implementations optimized for wide-area general-purpose Internet networks, etc., and rolls its own protocols that assume a simpler problem definition, much different response times, and probably just pastes some simple packet counters over Layer 2 Ethernet, probably with jumbo frames.

    If you've implemented your ugly hackery properly, you still _could_ bridge it over wide areas using standard routers even though it doesn't have an IP layer. That doesn't mean it would work well - TCP's flow control mechanisms were designed (particularly during Van Jacobson's re-engineering) to deal with real-world problems including buffer allocation at destination machines and intermediate routers and congestion behaviour in wide-area networks with lots of contending systems, which a LAN-cluster protocol might not handle because it "knows" problems like that won't happen. Timing mechanisms are especialy dodgy - they might have enough buffering to allow for LAN latencies in the small microseconds, but not enough to support Wide Area Network latencies that include speed-of-light (~1ms per 100 miles one-way in fiber) or insertion delay (spooling the packet onto a transmission line takes time, e.g. 1500 byte packet onto 1.5 Mbps T1 line takes about 8ms, and jumbo-frames obviously take longer.)

  • by chemguru (104422) <infinite1der@gmail.cELIOTom minus poet> on Saturday April 22, 2006 @06:18PM (#15182122) Homepage
    We've built a Sun Cluster with SCI for 10G RAC. It's not just about the bandwidth of the interconnect, but the latency. With 10G RAC, you can use SCI to allow shared memory segments between each node. Damn good stuff. Too bad SCI is rarely used.
  • by Junta (36770) on Saturday April 22, 2006 @08:17PM (#15182517)
    Lately, the big contenders are:
    -Ethernet
    -Inifiniband
    -Myrinet

    I haven't heard much about SCI or Quadrics lately, and just these three have been tossed around a lot lately. Points on each:
    -Ethernet is cheap, and frequently adequate. Low throughput and high latency, but it's ok. 10GbE ethernet is starting to proliferate to eliminate the throughput shortcomings, and RDMA is starting to possibly help latency for particular applications. Note that though overwhelmingly clusters put together using ethernet use IP stack to communicate over it, it is not exclusively true. There are MPI implementations available that sit right under the ethernet header layer. It bypasses the OS IP stack which can be very slow and reduces overhead per message. Increasing MTU also helps throughput efficiency. But for now only 1 Gigabit ethernet is remotely affordable at any scale (primarily due to current 10GbE switch densities/prices, adapters are no more expensive than Myrinet/Infiniband).

    -Myrinet. With their PCI-E cards they achieve about 2 GBytes/sec bidirectional throughput, very nearly demonstrating full saturation of their 10GBit fabric. They also are among the lowest latency sitting right about 2.5 microsecond node-to-node latency as a PingPong minimum. Currently the highest single-link throughput technology realistically available to a customer (Infiniband SDR doesn't quite acheive it, about 200 or so MByte/s short, but DDR will overtake it as it realistically is available). Very focused on HPC and until recently also the only popular high-speed cluster interconnect that was very mature, easy to set up and maintain, and efficient. Now they are starting to embrace more interoperability with 10GbE, probably in response to the rise of infiniband.

    -Infiniband. Until very recently immature (huge memory consumption for large MPI jobs, software stack that is highly complex and not easily maintanable, and the prominent vendor of chips (Mellanox), didn't acheive good latency. With Mellanox chips you are lucky to get into the 4 microsecond range or so. With Pathscale's alternative implementation (particularly on HTX), the lowest latency interconnect becomes possible (I have done runs with 1.5 microsecond end-to-end latency even with a switch involved). The maximum throughput is on the order of 1.7-1.8 GByte/s and more importantly is one of the faster technologies in ramping up to that. No technology acheives their peak throughput until about 4 MB message sizes, and Pathscale IB is remarkably a good performer down to 16k-32k message sizes. Additionally, IB has a broader focus and some interesting efforts. They make efforts to not only be a good HPC interconnect, but also to be a good SAN architecture that in many ways significantly outshines fibre channel. The OpenIB efforts are interesting as well. The huge downside is that for whatever reason no Infiniband provider has been able to demonstrate good IP performance over their technology. This particularly is an issue because most all methods of storage sharing from hosts are IP based. SRP is ok for the little amount of flexibility that strategy gives to be Fibre-Channel like, but nfs, smb, and image access like NBD and iSCSI all perform very poorly on Infiniband compared to Myrinet. iSER promises to alleviate that, but for the moment you are restricted to performance on the order of 2.4 gigabit/s for IP transactions. Myrinet has been able to deliver 6-7 Gigabit/s for the same measurements. You could overcome this by sharing storage enclosures and use something like lustre, GFS, or GPFS to communicate more directly with the storage over SRP, but generally speaking some applications demand flexibility not acheivable without IP performance.

    And at the end of the day, I come home and run my home network on 100MBit ethernet, sigh. It is enough to run a diskless MythFrontend for HD content at least.
  • Just want to add... (Score:5, Informative)

    by Junta (36770) on Saturday April 22, 2006 @08:37PM (#15182582)
    For those not aware of how ethernet is limited latency wise regardless of what is done, I will explain a tad.

    Ethernet is well architected for large deployments (enterprise-wide) with the packet routing (not IP routing) done on the switches. Menaing a computer sending a packet asks its switch to get it to 0A:0B:0C:01:02:03, having no idea where it will go. Switch only knows it's immediate neighbors, and will check/populate it's arp table to figure out the next entity to hand off. This means switches have to be really powerful because they are responsible for a lot of heavy lifting for all the relatively dumb nodes. This is not TCP, it is not IP, it is raw reality of ethernet networking. Aside from Spanning tree (which is not maintained for any other reason than keeping a network from getting screwed over by incorrect connections, not for performance), no single entity in the network has a map of how things look beyond its immediate neighbors.

    IB, Myrinet, etc, are source routed. Every node has a full network map of every switch and system in the fabric. The task of computing communication pathways is distributed rather than concentrated (fits well with the whole point of clusters). node1 doesn't blindly say to the switch, 'send this to node636', it says to switch 'send this to port 5, and the next switch, put it out port 2, and the next switch, do port 9 and then it should be where it needs to be'.

    There are more complicated issues their, but a lion's share of the inherent strength of non-ethernet interconnects is this.
  • Not really (Score:3, Informative)

    by Junta (36770) on Saturday April 22, 2006 @08:53PM (#15182637)
    One, the price of all this stuff is exhorbitant, and most home applications could barely benefit from going from 100 MBit to Gigabit. Realistically, getting 1.5 microsecond latency and the ability to transfer GigaBYTES per second has no home use right now. Really exorbitant High definition streams top out at about 20 MBit/second for 1920x1080 MPEG-2, and of course no game demands that much throughput. Hard drives for home use can only theoreticly dump out 300 MB/s or so anyway (SATA II), and realistically except for cache operations you almost never acheive it.

    Going to gigabit ethernet makes diskless systems close to theoretically working as fast as UDMA 66 drives, which allows for fun home projects working more smoothly. Latency for network operations is already similar to drive seek times, so going to insane latency won't help too much either.

    Systems that benefit from this have to have large (many-drive) storage architectures to pull throughput from and large numbers of systems to have enough computational data to make the interconnect fabrics worth while. Before you begin to ever approach a system that large, your power/cooling bill would be insane.

    If you were into the intrensic interesting stuff of this, you can learn most principles involved with good old ethernet, and fill the gaps with google research. It is undeniable that you learn more hands on, but if you ever really need to use it with a company or something and you have your bases covered, chances are you'd exceed most other candidates who aren't even aware of the technology.
  • by tengu1sd (797240) on Sunday April 23, 2006 @03:00AM (#15183624)
    See page 3 of the Custering Software Product Description [hp.com]

    Cluster systems are configured by connecting multiple systems with a communications medium, referred to as an interconnect. OpenVMS Cluster systems communicate with each other using the most appropriate interconnect available. In the event of interconnect failure, OpenVMS Cluster software automatically uses an alternate interconnect whenever possible. OpenVMS Cluster software supports any combination of the following interconnects:

    CI (computer interconnect) (Alpha and VAX)

    DSSI (Digital Storage Systems Interconnect) (Alpha and VAX)

    SCSI (Small Computer Storage Interconnect) (storage only, Alpha and limited support for I64)

    FDDI (Fiber Distributed Data Interface) (Alpha and VAX)

    Ethernet (10/100, Gigabit) (I64, Alpha and VAX)

    Asynchronous transfer mode (ATM) (emulated LAN configurations only, Alpha only)

    Memory Channel (Version 7.1 and higher only, Alpha only)

    Fibre Channel (storage only, Version 7.2-1 and higher only, I64 and Alpha only)

  • Further benchmarks.. (Score:2, Informative)

    by keithoc (916498) on Sunday April 23, 2006 @05:39AM (#15183962)
    ..can be gotten from the results [utk.edu] of the 2005 HPC Challenge [hpcchallenge.org] - real world results, no marketing blurb.

Imitation is the sincerest form of plagarism.

Working...