Actually, things are much more complex, and as some other poster mentioned, these issues are the continuing subject of research, and are expected by the supercomping community since quite a few years (simply projecting current statistics, the time required to checkpoint a full-machine job is would at some point become bigger thant the MTBF...)
The PhD student mentioned seems to be just one of many working on this subject. Different research teams have different approaches, some trying to hide as much possible in the runtimes and hardware, others experimenting with more robust algorithms in applications.
Tradeoffs on HPC clusters are not the same as on "business" type computers (high-throughput vs. high availability). For tightly coupled computations, a lot of data is flying around the network, and networks on these machines are fast, high throughput, and especially low-latency networks, with specific hardware, and in quite a few cases, partial offload of message management, using DMA writes or other techniques which might make checkpointing message queues a tad complex. The new MPI-3 standard has only minimal support for error handling, simply because this field is not mature/consensual enough, in the sense that not everyone agrees on the best solutions yet, and these may depend on the problem being solved and its expected running time. Avoiding too much additional application complexity and major performance hits is not trivial.
In addition, up to now, when medium to large computations are batch jobs that may run a few hours to a few days on several thousand cores, re-running one a computation that failed due to hardware failures once in a while (usually much less than one in 10 times is much more cost-effective than duplicating everything, in addition to being faster. These applications do not usually require real-time results, and even for many time-constrained applications (such as tomorrow's weather), running almost twice as many simulations (or running them twice as fast in the case of ideal speedup) might often be more effective. This logic only breaks with very large computations.
Also, regarding similarity to the cloud, when 1 node goes down on most clusters, the computation running on it will usually crash, but when a new computation is started by a decent resource manager/queuing system, that node will not be used, so everything does not need to be replaced immediately (that issue is at least solved). So most jobs running on 100th or 1/10th of a 100000 to 1 million node cluster will not be too much affected by random failures, but a job running on the full machine will be much more fragile.
So, as machines get bigger and these issues become statistically more of an issue, an increasing portion of the HPC hardware and software effort needs to be devoted to these, but the urgency is not quite the same as if your bank had forgotten to use high-availability features for its customer's account data, and the dradeoffs reflect that.