Cray XT-3 Ships 260
anzha writes "Cray's XT-3 has shipped. Using AMD's Opteron processor, it scales to a total of 30,580 CPUs. The starting price is $2 million for a 200 processor system. One of its strongest advantages over the std linux cluster is that it has an excellent interconnect built by Cray. Sandia National Labs and Oak Ridge National Labs are among the very first customers. Read more here."
Sic transit gloria mundis (Score:2, Insightful)
Re:we're getting closer... (Score:3, Insightful)
It was funny like a year ago. Now it's as overused as an SNL skit.
Re:Opterons and PowerPC together (Score:3, Insightful)
It's not that they're the best thing since sliced bread, it's mainly that all their competition went down the chute for one reason or another.
HP/Compaq/DEC was the king of supercomputers. Now they're only supporting their formerly glorious products, with practically nothing new comming to replace it.
Sun seems to really be sitting on their ass.
Intel was trying, but screwed the pooch with Itanium/Itanic.
SGI was a competitor, but they've just faded out.
Motorola could compete if they put some effort into it, but they've been out of it for some time.
etc.
Re:real FPU operations (Score:4, Insightful)
Re:Nuclear Simulations (Score:2, Insightful)
I admire your positive outlook on the prospects of simulations, but as an experimentalist, I find this "soon we won't need experiments at all" (see Rev. Mod. Phys. 64, 1045-1097 (1992), for instance) attitude very dangerous. Simulations and models, even at the first principles level, should never be trusted implicitly. They only sure way to tell how nature works is via experimentation.
I can sort of understand simulating nuclear explosions, but simulating the aging process of a warhead doesn't make that much sense to me - unless the simulations are accompanied by direct observation of the (accelerated) aging of a warhead.
Re:imagine a... (Score:5, Insightful)
When you have a single CPU, designing the system to be pretty fast is easy. There's no major contention to deal with.
Two CPUs? Slightly harder, but reasonably straightforward. You don't see a 2x improvement in speed over one CPU, but it's around 1.95x, give or take a bit.
Four CPUs? Now you're starting to see less improvement ... probably around 3.2x, because of all the contention issues.
Sixty-four CPUs? You'll be lucky to get a 50x speed up over a single CPU.
When you get to 200 CPUs, the issue of access to shared memory and other shared resources becomes critically important. It's also an issue that most computer buyers don't need to worry about, because they don't have 200 CPUs in their system. This means that you have a lot of highly specialised research going on, and relatively few buyers to spread the cost of that research over.
Two million for a 200 CPU box which has low latency, low contention, and solid reliability is not a lot at all. You might not buy it. That doesn't mean nobody will.
Re:The math for a comparable Xserve system (Score:5, Insightful)
What a value!!
That is, until you throw a tightly coupled problem at it and the Cray is 10 times faster because it has much better internode bandwidth and lower latency.
And, you forgot to count the cost of the InfiniBand interconnect that the VT cluster used? That's a couple grand per node.
Bottom line, apples and oranges. If your applications is easily parallelizable (i.e. doesn't require much communication between the nodes) you'd be stupid to piss away your money on a "real" supercomputer instead of a cluster. And vice versa.
Re:My new dream toy (Score:1, Insightful)
Re:we're getting closer... (Score:5, Insightful)
Back in my day we spelled "enuff" without the 'f' character and it was good enough for us.
Re:Just the name brings back memories (Score:5, Insightful)
Curiously the xt3 IS about shaving dollars off the price. If you go read the origional whitepapers on the system, they go through EXTENSIVE cost-return analysis. They studied their (then-) current generation of cluster systems, as well as future linux/solaris/aix clusters, and rejected them as (interestingly) FAR TOO EXPENSIVE, once the administrative costs are factored in. They then looked at, and rejected, cray's vector solution, the X1. They then decided that the (amazingly) most cost effective solution was to underwrite cray's product development cycle on a wholey new product. Basically they asked for an update to the system they already had. (asci-red i.e. intel paragon++) Nobody was building such a thing. Since cray had a really strong similar product in the 90s. (T3D, T3E) the department of energy asked them to create an update. Some designs never die.
What I'm most interested in is the reliability. One of the biggest difficulties in the T3D engineering cycle was dealing with memory failure. red-storm is going to have 10,000 processors. Lets assume each has 2 banks time 3 dimms (chip-kill) of memory. That means there are 10,000 x 6 x 18 = 1 million+ memory chips in the system. IF 1/100th or a percent of these fail, that's still a lot of memory failures. How well are faults isolated? That's the big question for systems this big.
I'm also a little wary of cray's use of lustre. I've used lustre before, as well as other cluster-FSes. While I'm not aware of other filesystems that will scale to 700+ i/o nodes, I'm not confident in lustre. It's an immature product at best. (I don't mean to disparage the people working on it, it's a neat architecture, but it's a hard problem, and I'm not sure it's ready for prime-time.)