Forgot your password?
typodupeerror

Comment: In Other Words (Score 1) 81

by Tirian (#42607059) Attached to: DARPA Wants Distributed Network of Deep Sea Storage Units

In other words:

"We are famously over-stocked on items that we are not actually using because of huge budget allocations. We don't want to lose those budget numbers and the goverment is saying we need to buy their defense contractor friends' goods. The plan is to just purchase a billion dollars of equipment and just sink it never to be seen again. Everybody wins, except maybe the taxpayers."

--Tirian

Comment: Re:"and they halt operations when they do so" (Score 2) 112

by Tirian (#42062479) Attached to: Supercomputers' Growing Resilience Problems

Many supercomputers that utilize specialized hardware just can't take component failure. For example, on a Cray XT5, if a single system interconnect link (SeaStar) goes dead the entire system will come to a screeching halt because with SeaStar all the interconnect routes are calculated at boot and can not update during operation. In any tightly coupled system these failures are a real challenge, not just because the entire system may crash, but if users submit jobs requesting 50,000 cores but only 49,900 cores are available.

Checkpoints are necessary, but in large-scale situations they are often difficult. You usually have a walltime allocation for your job and you certainly don't want to use 20% of it writing checkpoint files to Lustre (or whatever high-performance filesystem you are utilizing). Perhaps frequent checkpointing works on smaller systems/jobs, but for a capability job on a large system you are talking about a significant block of non-computational cycles being burned.

Anyone can do any amount of work provided it isn't the work he is supposed to be doing at the moment. -- Robert Benchley

Working...