Working in a lab environment I test DDR3 memory on a daily basis and we run into a lot of failures from JEDEC violation to blatant byte/word/dword corruption and even single bit failures. Single bit failures are by the far the worst to debug. Kudos to this guy for tracking it down. I am going to add these debug procedures to my arsenal!
When I encounter a failure, logging all information is of course the first thing I do, but reproducibility is key! With reproducibility, like the article says, you're able to throw as many experiments at it as you can think up. We will run memtest86+ among other tools to gather data on whether the failure reproduces with other tests. In the case we believe it is a DRAM part failure, we will utilize Logic analyzers and Oscilloscopes to determine and prove that the failure is on a specific component.
Sometimes failures we encounter are DIMM vendor issues, sometimes our own, induced by bad in house memory test software/hardware
UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever things. -- Doug Gwyn