Flawed AMD Chip Can Lead To Data Corruption 203
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
Corruption (Score:3, Informative)
An old problem (Score:5, Informative)
I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.
Re:Deja Vu: Intel Processor's Bug in 1994 (Score:4, Informative)
forth paragraph in TFA.
Re:An old problem (Score:2, Informative)
You're thinking about magnetic cores.
Whenever you reverse a core's magnetic field, its temperature rises a little. Keep reversing the field fast enough and for a long enough period of time, and the core (or maybe the wires running through it?) will melt, permanently damaging that bit.
Re:Kernel fix? (Score:5, Informative)
Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.
Re:An old problem (Score:5, Informative)
I'm pretty sure that gave rise to the joke "Halt and Catch Fire"...
I always figured that if you were to burn out a register from overuse, it would be the carry bit
Anyway, as to the story at hand, it sounds like this would only ever occur a) to only 3000 processors total - MAYBE, and b) would only ever happen under such an artifically contrived laboratory stress-test/benchmark situation. Any CPU running in a real system would a) have to do other things like service hardware interrupts, and b) wouldn't do something useless like perform a looping calculation without checking to see if it was done periodically. It really sounds like this is a big non-issue in reality.
Re:nice! (Score:3, Informative)
source [net-security.org]
Fearmongering? No, you misunderstand ... (Score:3, Informative)
I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.
Prime95 as a detection tool? (Score:2, Informative)
Could Prime95 be used to identify those AMD chips?
Re:Sounds familiar (Score:5, Informative)
Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.
You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.
All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...
smash.
Re:An old problem (Score:5, Informative)
I'm not saying AMD should be let off the hook completely, but the bug isn't a big problem, they are offering free replacements, and they publicized it. The FDIV bug was bigger (though still hardly catastrophic), refused (at first) to offer replacements, and they sat on it. The two scenarios are nowhere near similar. Maybe AMD just has more character than Intel, or maybe they were watching in 94/95 when the FDIV bug happened and they've actually learned from Intel's mistakes. Regardless, this whole story is more of a heads-up to concerned buyers than a criticism of AMD.
This will not happen to you (Score:5, Informative)
So under normal conditions on normal PC hardware, this simply won't happen.
Re:Quality Control at AMD must be good. (Score:3, Informative)
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field.
This kind of thing is standard practice. If you want to stress test a piece of hardware, you write specialised test code which will consume the maximum amount of power possible, not a real world program. You have to be sure that nobody will be able to write software which will drive the processor harder than your tests have. Its good that AMD found this fault, and even better that they owned up to it, but it's not remarkable.
Re:Corruption (Score:3, Informative)
Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.
I have actually studied this bug, and it is only observed when the fpu code is iterated in the MILLIONS of times without ever executing another instruction (only a tight FPU loop), in addition, the environmental temperature must also be high (think tropical). AMD has stated (1) that this problem has never been identified in actual production code (only a single benchmark in these environmental conditions); and (2) that they are identifying and replacing (for free) all affected CPUs. It is estimated that 2-3,000 chips have this particular defect (out of the millions shipped). Further, AMD has added an additional validation step to identify processors affected by this glitch, which will cause them to be pushed down to a lower speed grade (i.e. 2.8GHz affected CPUs will be sold as 2.6GHz parts), where this problem does not manifest itself.
I for one am happy that this story broke 2 days ago, and 1 day ago AMD had already figured out which CPU batches could potentially be affected, and is offering free replacements (without the customer complaining first). Now today it's on Slashdot. At least this isn't the F00F bug which Intel didn't tell anyone about until the public discovered it and raised hell. Further, the likliehood of data corruption caused by this glitch, even in fpu-heavy code, is extremely unlikely as there would be other non-fpu instructions executed in between in nearly every case (except extreme benchmarking-- i.e. the reason AMD discovered the problem in the first place).
Surprising. AMD uses my `cpuburn` (Score:5, Informative)
Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.
But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.