> Wow. How impressive. Oh wait, Linux has had EDAC since 2006. But you keep paying your millions to Oracle. I'm sure its worth it.
Actually, this might be worth an illustration. It was a long time back, so I'm sure I've forgotten a few details, but I'll give you the big picture.
Around 2000, Sun Microsystems had a problem with the L2 cache on their 400mhz CPUs. It seems that IBM misrepresented the error rate on the chips, and they were having bit errors that were much higher than specified. Because of what was supposed to be an incredibly low error rate, they engineered the L2 cache with parity protection. That's enough to detect an error and cause a UE (uncorrectable error) event. So I know that your EDAC functionality in 2006 was in Solaris well before 2000.
After that problem, Sun Microsystems did two things. First, they mirrored the L2 cache. Second, they completely beefed up their handler for CE/UE (correctable errors and uncorrectable errors) along the memory/cache/bus/cpu to bring it up to Enterprise level error handling. You get an Uncorrectable Error in your CPU's L2 cache. Do you panic? I looked over the EDAC documentation and I could be wrong (please correct me, if so) but it looks like that would result in a panic. Or you could just have it log that the UE event happened but take no action.
What would Solaris do differently? It would find the page of virtual memory that had the corresponding error. Has it been modified? If not, just discard the page, log the event, and go on. There is a whole set of rules it goes through to determine the best way to keep the system running when it hits an uncorrectable error. Let's say that the page was modified and that there was an uncorrectable error in the L2 cache. We panic now, right? No. Solaris checks and sees who the page of memory belongs to. If it is a user process, then that process is simply killed (and the event logged) and the OS continues running. Only if it is a dirty page of active kernel memory do we have a panic.
That isn't just recovering from a soft error. That's recovering from a hard error. So, as this story illustrated, there are quite a number of things happening behind the scenes in an enterprise level OS. You picked a good example with Linux EDAC.