Cluster Interconnect Review 64
deadline writes to tell us that Cluster Monkeys has an interesting review of cluster interconnects. From the article: "An often asked question from both 'clusters newbies' and experienced cluster users is, 'what kind of interconnects are available?' The question is important for two reasons. First, the price of interconnects can range from as little as $32 per node to as much as $3,500 per node, yet the choice of an interconnect can have a huge impact on the performance of the codes and the scalability of the codes. And second, many users are not aware of all the possibilities. People new to clusters may not know of the interconnection options and, sometimes, experienced people choose an interconnect and become fixated on it, ignoring all of the alternatives. The interconnect is an important choice and ultimately the choice depends upon on your code, requirements, and budget."
Interconnect Failure? (Score:2)
Re:No comments and it's slashdotted? (Score:3, Insightful)
Interesting? (Score:3, Funny)
Re:Interesting? (Score:2)
Re:No comments and it's slashdotted? (Score:2)
This is the very first comment on this story, and it's moderated redundant? What kind of a moron moderates like that?
I mean, I could understand if it were moderated as offtopic, or something, but redundant?
Dumbass.
News? (Score:2, Funny)
Re:News? (Score:2, Funny)
Re:News? (Score:1)
News media usually focuses on the exception, not the norm. Besides, I find clusters interesting.
Re:News? (Score:2)
/.ed (Score:5, Funny)
Mirror (Score:5, Informative)
because some people practice what they preach
Storm
Re:Mirror (Score:2)
if you care, use ib. the linux support is still a little funky, but in terms of application performance for the dollar, its hard to beat. tcp is gorgeous for sharing buffer space in the wide area, but its alot of work for a tightly coupled machine.
Re:Mirror (Score:3)
GigE FTW (Score:1)
I have trouble imagining an application that could eat up more than that. It's bananas.
Re:GigE FTW (Score:1)
Re:GigE FTW (Score:2)
Re:GigE FTW (Score:2, Interesting)
I call BS (Score:2)
Re:I call BS (Score:2)
We can have it to 15 GB/s for most conversation (because he implied concurrent read/write, and most people just discuss unidirectional bandwidth)
That brings us down to about 150-170 Gb/s required to measure a cross-sectional bandwith of 30 GB/s.
So, say, a 256 node cluster running something like GPFS or Lustre even on gigabit ethernet might play in the realm of 30 GB/s concurrent read-write throughput. This is assuming nodes contributing their storage to a pool and not
Re:I call BS (Score:1)
Nitpicking... (Score:3, Informative)
"every time a packet gets sent to the NIC, the kernel must be interrupted to at least look at the packet header to determine of the data is destined for that NIC"
Uhm.. no. That is only the case if the NIC has been set into promiscuous mode, or if it has joined too many multicast groups.
"...because the packets are not TCP, they are not routable across Ethernet"
Uhm. If they were IP they would still be routable. I suspect he meant "not IP".
I also get irritated by the spelling out of "Sixty-four-bit PCI"
But the article still has a lot of good reviews and a load of links to other sides with interesting info.
Confusing TCP and IP *is* annoying (Score:5, Informative)
But he's probably talking about some kind of application that's intended for local-area application only and wants to avoid the overhead of TCP, UDP, and IP addressing, header-bit-twiddling, flow control, slow-start, kernel implementations optimized for wide-area general-purpose Internet networks, etc., and rolls its own protocols that assume a simpler problem definition, much different response times, and probably just pastes some simple packet counters over Layer 2 Ethernet, probably with jumbo frames.
If you've implemented your ugly hackery properly, you still _could_ bridge it over wide areas using standard routers even though it doesn't have an IP layer. That doesn't mean it would work well - TCP's flow control mechanisms were designed (particularly during Van Jacobson's re-engineering) to deal with real-world problems including buffer allocation at destination machines and intermediate routers and congestion behaviour in wide-area networks with lots of contending systems, which a LAN-cluster protocol might not handle because it "knows" problems like that won't happen. Timing mechanisms are especialy dodgy - they might have enough buffering to allow for LAN latencies in the small microseconds, but not enough to support Wide Area Network latencies that include speed-of-light (~1ms per 100 miles one-way in fiber) or insertion delay (spooling the packet onto a transmission line takes time, e.g. 1500 byte packet onto 1.5 Mbps T1 line takes about 8ms, and jumbo-frames obviously take longer.)
Re:Confusing TCP and IP *is* annoying (Score:1)
Re:Confusing TCP and IP *is* annoying (Score:1)
Actually TUBA [ietf.org] was early to mid-90's.
(At least we ended up with IPv6 instead which is way, way, better because... um... never mind.)
Re:The Codes? (Score:1)
Re:The Codes? (Score:2, Informative)
Re:The Codes? (Score:2)
Re:The Codes? (Score:1)
Code = one application or one logical part of an application
Codes = multiple applications or multiple logical parts of an application
Code is code is code when you are talking about programming in general. All of us are programmers and we write code.
"Code" can refer to a specific entity. Sometimes you'll hear it as "codebase" or "source" or "sourcebase". An example of a specific set of code is the Linux kernel. Another example of a specific set of code is Firefox.
Codes refers
Re:The Codes? (Score:2)
Good Intro Article (Score:1)
Not a technical issue, but a little note about the Infiniband cards reading, "Unlike the alternatives, just try to get information on pricing one of these without leaving all of your contact information for a salesman to use now and in perpetuity." I've been through this recently, and have considered (given the similar performance), purchasing Myri
Re:Good Intro Article (Score:2, Informative)
Re:Good Intro Article (Score:1)
and lurking there in the background, there's always NWChem....
Re:Good Intro Article (Score:1)
Further benchmarks.. (Score:2, Informative)
IB Pricing (Re:Good Intro Article) (Score:2)
For parallel computation cheap can be okay (Score:1)
So once you have one, what do you do with it? (Score:2)
Not really (Score:3, Informative)
Not able to RTFA, but my perspective... (Score:5, Informative)
-Ethernet
-Inifiniband
-Myrinet
I haven't heard much about SCI or Quadrics lately, and just these three have been tossed around a lot lately. Points on each:
-Ethernet is cheap, and frequently adequate. Low throughput and high latency, but it's ok. 10GbE ethernet is starting to proliferate to eliminate the throughput shortcomings, and RDMA is starting to possibly help latency for particular applications. Note that though overwhelmingly clusters put together using ethernet use IP stack to communicate over it, it is not exclusively true. There are MPI implementations available that sit right under the ethernet header layer. It bypasses the OS IP stack which can be very slow and reduces overhead per message. Increasing MTU also helps throughput efficiency. But for now only 1 Gigabit ethernet is remotely affordable at any scale (primarily due to current 10GbE switch densities/prices, adapters are no more expensive than Myrinet/Infiniband).
-Myrinet. With their PCI-E cards they achieve about 2 GBytes/sec bidirectional throughput, very nearly demonstrating full saturation of their 10GBit fabric. They also are among the lowest latency sitting right about 2.5 microsecond node-to-node latency as a PingPong minimum. Currently the highest single-link throughput technology realistically available to a customer (Infiniband SDR doesn't quite acheive it, about 200 or so MByte/s short, but DDR will overtake it as it realistically is available). Very focused on HPC and until recently also the only popular high-speed cluster interconnect that was very mature, easy to set up and maintain, and efficient. Now they are starting to embrace more interoperability with 10GbE, probably in response to the rise of infiniband.
-Infiniband. Until very recently immature (huge memory consumption for large MPI jobs, software stack that is highly complex and not easily maintanable, and the prominent vendor of chips (Mellanox), didn't acheive good latency. With Mellanox chips you are lucky to get into the 4 microsecond range or so. With Pathscale's alternative implementation (particularly on HTX), the lowest latency interconnect becomes possible (I have done runs with 1.5 microsecond end-to-end latency even with a switch involved). The maximum throughput is on the order of 1.7-1.8 GByte/s and more importantly is one of the faster technologies in ramping up to that. No technology acheives their peak throughput until about 4 MB message sizes, and Pathscale IB is remarkably a good performer down to 16k-32k message sizes. Additionally, IB has a broader focus and some interesting efforts. They make efforts to not only be a good HPC interconnect, but also to be a good SAN architecture that in many ways significantly outshines fibre channel. The OpenIB efforts are interesting as well. The huge downside is that for whatever reason no Infiniband provider has been able to demonstrate good IP performance over their technology. This particularly is an issue because most all methods of storage sharing from hosts are IP based. SRP is ok for the little amount of flexibility that strategy gives to be Fibre-Channel like, but nfs, smb, and image access like NBD and iSCSI all perform very poorly on Infiniband compared to Myrinet. iSER promises to alleviate that, but for the moment you are restricted to performance on the order of 2.4 gigabit/s for IP transactions. Myrinet has been able to deliver 6-7 Gigabit/s for the same measurements. You could overcome this by sharing storage enclosures and use something like lustre, GFS, or GPFS to communicate more directly with the storage over SRP, but generally speaking some applications demand flexibility not acheivable without IP performance.
And at the end of the day, I come home and run my home network on 100MBit ethernet, sigh. It is enough to run a diskless MythFrontend for HD content at least.
Just want to add... (Score:5, Informative)
Ethernet is well architected for large deployments (enterprise-wide) with the packet routing (not IP routing) done on the switches. Menaing a computer sending a packet asks its switch to get it to 0A:0B:0C:01:02:03, having no idea where it will go. Switch only knows it's immediate neighbors, and will check/populate it's arp table to figure out the next entity to hand off. This means switches have to be really powerful because they are responsible for a lot of heavy lifting for all the relatively dumb nodes. This is not TCP, it is not IP, it is raw reality of ethernet networking. Aside from Spanning tree (which is not maintained for any other reason than keeping a network from getting screwed over by incorrect connections, not for performance), no single entity in the network has a map of how things look beyond its immediate neighbors.
IB, Myrinet, etc, are source routed. Every node has a full network map of every switch and system in the fabric. The task of computing communication pathways is distributed rather than concentrated (fits well with the whole point of clusters). node1 doesn't blindly say to the switch, 'send this to node636', it says to switch 'send this to port 5, and the next switch, put it out port 2, and the next switch, do port 9 and then it should be where it needs to be'.
There are more complicated issues their, but a lion's share of the inherent strength of non-ethernet interconnects is this.
Re:Not able to RTFA, but my perspective... (Score:3, Insightful)
Re:If a packet hits a pocket (Score:1)
Re:If a packet hits a pocket (Score:1)
Would have been a lot funnier had it been attributed to the original author in my opinion.
Re:If a packet hits a pocket (Score:1)
More or less took it at face value. Your post raises two questions.
How about sharing the answers - I'm curious
If it doesn't say OpenVMS it's not a cluster (Score:2, Informative)
Cluster systems are configured by connecting multiple systems with a communications medium, referred to as an interconnect. OpenVMS Cluster systems communicate with each other using the most appropriate interconnect available. In the event of interconnect failure, OpenVMS Cluster software automatically uses an alternate interconnect whenever possible. OpenVMS Cluster software supports any combination of the following interconnects:
CI (computer inte
Re:If it doesn't say OpenVMS it's not a cluster (Score:2)
Yes Digital invented clustering with OpenVMS. But since then the "clustering" brand has fragmented into HA (Highly Available) clusters and Compute Clusters like Beowulf (and variations on the theme). OpenVMS is a HA cluster, and still rules the HA roost because it does such an amazing job of combining shared storage with SSI (single system image) functionality. On the Unix front, only Tru64 has come close to OpenVMS, but sadly that's in decline and will very soon vanish.
There also seems to be zero inter