Slashdot is powered by your submissions, so send in your scoop


Forgot your password?

Comment: Re:Their methoid is nothing new. (Score 1) 112

by Yvan Fournier (#42062855) Attached to: Supercomputers' Growing Resilience Problems

Actually, things are much more complex, and as some other poster mentioned, these issues are the continuing subject of research, and are expected by the supercomping community since quite a few years (simply projecting current statistics, the time required to checkpoint a full-machine job is would at some point become bigger thant the MTBF...)
The PhD student mentioned seems to be just one of many working on this subject. Different research teams have different approaches, some trying to hide as much possible in the runtimes and hardware, others experimenting with more robust algorithms in applications.

Tradeoffs on HPC clusters are not the same as on "business" type computers (high-throughput vs. high availability). For tightly coupled computations, a lot of data is flying around the network, and networks on these machines are fast, high throughput, and especially low-latency networks, with specific hardware, and in quite a few cases, partial offload of message management, using DMA writes or other techniques which might make checkpointing message queues a tad complex. The new MPI-3 standard has only minimal support for error handling, simply because this field is not mature/consensual enough, in the sense that not everyone agrees on the best solutions yet, and these may depend on the problem being solved and its expected running time. Avoiding too much additional application complexity and major performance hits is not trivial.

In addition, up to now, when medium to large computations are batch jobs that may run a few hours to a few days on several thousand cores, re-running one a computation that failed due to hardware failures once in a while (usually much less than one in 10 times is much more cost-effective than duplicating everything, in addition to being faster. These applications do not usually require real-time results, and even for many time-constrained applications (such as tomorrow's weather), running almost twice as many simulations (or running them twice as fast in the case of ideal speedup) might often be more effective. This logic only breaks with very large computations.

Also, regarding similarity to the cloud, when 1 node goes down on most clusters, the computation running on it will usually crash, but when a new computation is started by a decent resource manager/queuing system, that node will not be used, so everything does not need to be replaced immediately (that issue is at least solved). So most jobs running on 100th or 1/10th of a 100000 to 1 million node cluster will not be too much affected by random failures, but a job running on the full machine will be much more fragile.

So, as machines get bigger and these issues become statistically more of an issue, an increasing portion of the HPC hardware and software effort needs to be devoted to these, but the urgency is not quite the same as if your bank had forgotten to use high-availability features for its customer's account data, and the dradeoffs reflect that.

Comment: Re:Code_Saturne (Score 1) 105

by Yvan Fournier (#32672356) Attached to: Best OSS CFD Package For High School Physics?
Code_Saturne is also now packaged in Debian unstable (as well as under Gentoo Linux and FreeBSD), so it should be even easier to install soon. I have also heard that a new version of CAELinux is under preparation, and that the SALOME platform is also being packaged under Debian (disclaimer: I am a Code_Saturne developer). Please use the new 2.0-rc versions found on instead of the old 1.4 version from the current CAELinux, as we've really improved the GUI and scripts, beyond the additional physical modeling capabilities that were added...

For post processing, ParaView or VisIt are much better than SALOME's current visualization module (a new ParaView-based module will be available in SALOME 6).

A few years ago, courses with Code_Saturne used pre-generated meshes, while students this year were taught to handle the whole process, including meshing under SALOME (actually, I believe both high-quality pre-generated meshes were made available and simpler meshes were generated by students, with the added advantage that the influence of mesh quality on result quality could be shown). The improvements in both SALOME and Code_Saturne's GUI certainly helped.

Still, the students that were trained are from some of France's top engineering schools, and I have no idea how simple high school students would cope. If the hands-on session is well prepared, it would seem feasible. Even if the students lack the theoretical background to really understand what the code does or judge the quality of a simulation, it can be an interesting experiment.

System restarting, wait...