Suppose that your job is computing along using about 3/4ths of the memory on node 146 (and every other node in your job) when that node's 4th DIMM dies and the whole node hangs or powers off. Where should the data that was in the memory of node 146 come from in order to be migrated to node 187?
There are a couple of usual options: 1) a checkpoint file written to disk earlier, 2) a hot spare node that held the same data but wasn't fully participating in the process.
Option 2 basically means that you throw away half of your compute resources in case there is a failure. Basically no computations being done today for scientific research are valuable enough to warrant this approach. Some version of Option 1 (periodic checkpointing and restarting) is always more cost effective. These systems, in the US at least, are generally between 2 and 10 times over requested by the science community. Taking half away to seamlessly prevent an occasional job death just isn't worth the lost opportunity to more fully utilize the resources.
Option 1 implies taking some time away from your job to do the checkpointing. The vast majority of the time, some sort of OS-level automated checkpointing would be overkill as well. The author of the code knows better when is a good time to checkpoint and when it's a bad idea. I.e. you might consider checkpointing at a phase of the calculation when the data volume required to restore the state is at a minimum even if that means losing some part of a future calculation. Generally the calculation is cheap to redo since checkpoints of large volumes of data are expensive.
In addition, OS-level checkpointing is a hard problem. E.g. if there are messages in flight on the network, do you try to log them and be able to restart them, or do you only checkpoint when the network is quiet? If the network is never quiet on all the nodes in the job, do you throw in needless synchronization that could ruin the parallel efficiency of the job in order to find a place to do your automated checkpoint? If you decide to log instead, where do you write the log data in order to avoid catastrophic failure of each node, and what's the cost of doing it?
If these were just a bunch of VMs running a LAMP stack, this wouldn't be a hard problem. That's basically solved already. Migrating tasks for HPC jobs is truly a hard problem with tradeoffs to be considered.