Thats a great strategy only problem with it is from TFA the indication they received was noticing it was not behaving the way it was supposed to be behaving. They had to look around to figure out why.
Yes, because normal operations were suspended.
You don't have to assume anything. You KNOW the block is invalid. A bad block should not cripple the computer so that it can't do anything else. There is no indication from TFA there were any other faults.
No, you don't. You don't know if the block is bad, if the data bus is suffering an intermittent fault that happened to occur while that block was being read, if it's the BIST or ECC mechanisms that are faulty, or if it's a software error corrupting the data. Going from "we got a fault on reading this block" to "that block and only that block is affected, let's get on with it" with no consideration is a great way to lose a rover.
All I do is write software and I refuse to follow this shitty advice. Every error should be checked and handled. Besides the fricking hardware does all the heavy lifting for us
Ah, so you only allow your software to be run on hardware with ECC corrected RAM and ECC caches and ECC data busses... seems weird to call this a "PC app" when it's excluding most of the PC market. Unless you're doing it yourself then you're only checking for a subset of errors.
Now, assuming it's one that you can see, how do you "handle" that error? Do you just not read from that file again but continue on under the assumption that it was a singular event of no further consequence? Or do you have the software notify you so you can identify what the actual source of the read error was?
The former, I presume, which is fine for the situation of a PC app. Just let the hardware do the heavy lifting, and don't worry about what it can't find, and don't worry about what errors it signals actually mean. If it's a more serious error that ends up causing rampant corruption, it's not your problem! Contact your OEM, your help desk can tell them.
We're talking about I/O failure to flash not crashing an OS or broken hardware.
So, you don't see how an I/O failure could cause an OS to crash, like say if it's reading a code page, and you're still assuming that an ECC error on reading a block of flash can only mean that it's the ram cell itself and only that ram cell that could be affected? You're willing to bet 2.5 billion dollars and the rest of the mission on these assumptions?
Okay.
From TFA this is exactly what they did do...they waited to notice the rover not doing what it was supposed to be doing.
Which is a consequence of the rover doing what it should be doing -- ceasing normal activities on detecting a fault. We're talking about what the rover should be doing -- assume it's a one-off fault and continue normal operation minus that one block as you would have it, or try to prevent anything else bad from happening by going into safe mode (!= sleep mode, btw) and waiting for ground control to figure out the problem.
"We have probably several days, maybe a week of activities to get everything back and reconfigured."
Yes, and? Are you quibbling over "couple" vs "several" -- is this attempted pedantry, or are you actually implying that a couple days is fine, but waiting a week to finish the historic first-time analysis of an interior rock sample on Mars to make sure it has the maximum chance of success instead of bricking the rover crosses the line?