+ - Supercomputers Face Growing Resilience Problems->
Submitted
by
howardd21
howardd21 writes "CIO reports that as the number of supercomputer nodes grows, so will the problem of failures. Clusters made up pf many nodes, each with their own components and Mean Time Between Failure (MTBF) statistics can mean a lot of downtime and recovery.
Today's techniques for dealing with system failure may not scale very well require checkpointing, in which a running program is temporarily halted and its state is saved to disk. Should the program then crash, the system is able to restart the job from the last checkpoint. On a 100,000-node supercomputer, for example, only about 35 percent of the activity will be involved in conducting work. The rest will be taken up by checkpointing and — should a system fail — recovery operations."
Link to Original Source
Today's techniques for dealing with system failure may not scale very well require checkpointing, in which a running program is temporarily halted and its state is saved to disk. Should the program then crash, the system is able to restart the job from the last checkpoint. On a 100,000-node supercomputer, for example, only about 35 percent of the activity will be involved in conducting work. The rest will be taken up by checkpointing and — should a system fail — recovery operations."
Link to Original Source