Tirian - Slashdot User

Comment Re:"and they halt operations when they do so" (Score 2) 112

by Tirian on Wednesday November 21, 2012 @09:32PM (#42062479) Attached to: Supercomputers' Growing Resilience Problems

Many supercomputers that utilize specialized hardware just can't take component failure. For example, on a Cray XT5, if a single system interconnect link (SeaStar) goes dead the entire system will come to a screeching halt because with SeaStar all the interconnect routes are calculated at boot and can not update during operation. In any tightly coupled system these failures are a real challenge, not just because the entire system may crash, but if users submit jobs requesting 50,000 cores but only 49,900 cores are available.

Checkpoints are necessary, but in large-scale situations they are often difficult. You usually have a walltime allocation for your job and you certainly don't want to use 20% of it writing checkpoint files to Lustre (or whatever high-performance filesystem you are utilizing). Perhaps frequent checkpointing works on smaller systems/jobs, but for a capability job on a large system you are talking about a significant block of non-computational cycles being burned.

Comment Re:Haven't we seen this before? (Score 5, Informative) 95

by Tirian on Friday May 04, 2012 @09:57AM (#39889745) Attached to: Mars Rover Turns Up Evidence Of Water

The working theory is that the lack of a strong magentosphere on Mars has allowed the solar wind to cause much of the water that was once present to be lost to space.

--Michael

Slashdot Top Deals