You can't checkpoint jobs at this scale. It will take longer to checkpoint a job then to compute an answer. This is further compounded when the job takes several months to run. A 1000 node cluster is very tiny compared to the scale they're talking about.
Google is having the same problems that this article describes -- they haven't fixed it either.
If your problem domain can always be broken down into map-reduce, you can easily solve it with a hadoop-like environment to get fault tolerance. If your application falls outside of map-reduce (the applications this article is referring to), you need to start duplicating state (very expensive on systems of this scale) to recover from failures.
How do you give the work to another node when the failed node contains the only copy of its state (like in an MPI job)? Duplicating the state on multiple nodes is way too expensive.
Checkpoints will probably stick around for quite some time, but the model will need to change. Rather than serializing everything all the way down to a parallel filesystem, the data could potentially be checkpointed to a burst buffer (assuming a per-node design) or a nearby node (experimental SCR design). Of course, it's correct that even this won't scale to larger systems.
I think we'll probably have problems with getting data out to the nodes of the cluster before we start running into problems with checkpointing. The typical NFS home directory isn't going to scale. We'll need to switch over to something like udsl projections or another IO forwarding layer in the near future.
ANL's Mira is going to be roughly half as fast as LLNL's Sequoia.
If the developer is schizophrenic, then it might be n^2, otherwise it's probably (n(n-1))/2.
So, this economist and a computer scientist are sitting at a bar.... and these 5 girls walk in....
Intel CPUs are not defective, they just act that way. -- Henry Spencer