Flawed AMD Chip Can Lead To Data Corruption 203
Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."
I dub thee (Score:2, Funny)
Re:I dub thee (Score:2)
Re:I dub thee (Score:2)
I Have an AMD CPU (Score:5, Funny)
Re:I Have an AMD CPU (Score:5, Funny)
Interesting Perl script.
Re:I Have an AMD CPU (Score:5, Funny)
Interesting Perl script.
It's also rule number 26 in sendmail.cf.
Re:I Have an AMD CPU (Score:2)
Re:I Have an AMD CPU (Score:2)
It's an operating system.
With drivers.
And GUI.
And emacs.
Re:I Have an AMD CPU (Score:2)
Re:I Have an AMD CPU (Score:2, Funny)
Yeah, but unfortunately I hired a MCSE and it's turning out to be tougher than I thought training him.
An old problem (Score:5, Informative)
I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.
Re:An old problem (Score:3, Funny)
Re:An old problem (Score:2)
Re:An old problem (Score:5, Funny)
Re:An old problem (Score:2, Informative)
You're thinking about magnetic cores.
Whenever you reverse a core's magnetic field, its temperature rises a little. Keep reversing the field fast enough and for a long enough period of time, and the core (or maybe the wires running through it?) will melt, permanently damaging that bit.
Re:An old problem (Score:3, Funny)
Shit, so for once reversing the polarity does more harm than good?
Re:An old problem (Score:5, Informative)
I'm pretty sure that gave rise to the joke "Halt and Catch Fire"...
I always figured that if you were to burn out a register from overuse, it would be the carry bit
Anyway, as to the story at hand, it sounds like this would only ever occur a) to only 3000 processors total - MAYBE, and b) would only ever happen under such an artifically contrived laboratory stress-test/benchmark situation. Any CPU running in a real system would a) have to do other things like service hardware interrupts, and b) wouldn't do something useless like perform a looping calculation without checking to see if it was done periodically. It really sounds like this is a big non-issue in reality.
Re:An old problem (Score:3, Insightful)
Re:An old problem (Score:5, Insightful)
No sufficiently complex system can ever be completely bug-free.
and it's corollary:
It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
In that vein, someone once said "Foolproof is impossible because fools are so ingenious", and "As soon as an idiot-proof system is devised, they go and invent a better idiot!"
Professor Turing once contemplated this... (Score:2)
I have formed my own personal postulate/theory/law... and it's corollary: It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
Along those lines - many years ago, Professor Turing set out to find a test for [among other things] the possible presence of an infinite loop within a computer program.
Sadly, though, he didn't get very far with that line of inquiry... [google.com]
Re:An old problem (Score:2)
This is why we have automated reasoning systems, theorom provers, etc. They allow us to reduce the set of all possible states down to a set of orthogonal equivilence classes, only one example from which need to actually be tested.
Now, of course, at some point non-ideal physical characteristics can
Re:An old problem (Score:3, Interesting)
When the system can be used, it helps clear out logic bugs very efficiently.
That being said, today's
Re:Woah! (Score:2)
Re:An old problem (Score:2)
A defect that is known to give incorrect calculations is a serious issue that should be rectified via microcode update or exchange CPU for free (if microcode can't fix it).
Intel got raked over the coals for the FDIV problem, and so should AMD unless they do the right thing and offer an exchange/free fix so that users get the functional CPU they intended to purchase.
s
Re:An old problem (Score:5, Informative)
I'm not saying AMD should be let off the hook completely, but the bug isn't a big problem, they are offering free replacements, and they publicized it. The FDIV bug was bigger (though still hardly catastrophic), refused (at first) to offer replacements, and they sat on it. The two scenarios are nowhere near similar. Maybe AMD just has more character than Intel, or maybe they were watching in 94/95 when the FDIV bug happened and they've actually learned from Intel's mistakes. Regardless, this whole story is more of a heads-up to concerned buyers than a criticism of AMD.
Re:An old problem (Score:4, Insightful)
But you're right, since Intel blundered so badly on their handling of he FDIV bug, everyone else learned from it.
Re:An old problem (Score:2)
It is a textbook case in many MBA programs how _WELL_ Intel handled this.
They recalled EVERY CPU at their own expense of millions of dollars. Managing the recall, the disposal, the resupply, the competition, AND the PR nightmare was handled so well that this incident has become canon for MBA candidates.
Re:An old problem (Score:3, Interesting)
http://www.trnicely.net/pentbug/pentbug.html/ [trnicely.net]
Pay close attention to questions 9, 10, and 11. It explains what REALLY happened, and the author's opinions on the matter, which to my memory are quite accurate. How do I know? At the time I owned a Gateway Pent
Re:An old problem (Score:2)
Re:An old problem (Score:2)
My lord, this reinforces just about every stereotype of b-school students I developed while living in Schwab.
First Intel refused to replace the chips, except fo
Re:An old problem (Score:3, Interesting)
The PR nightmare was *caused* specifically by the way Intel handled the discovery. They thought that they had the righ
Re:An old problem.. now usedto fight the overlords (Score:2)
Re:An old problem (Score:2)
Fearmongering? (Score:2, Interesting)
Re:Fearmongering? (Score:2, Insightful)
The intel fanboys have been too noisy lately! AMD has more than 50% of the market since this year already!
Fearmongering? No, you misunderstand ... (Score:3, Informative)
I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.
Re:Fearmongering? No, you misunderstand ... (Score:2, Funny)
"allowed...to slip through...detection grid..." (Score:2)
Obligatory (Score:2)
I've been saying that for ages, check your results, but naah! Them young'uns and their series of memory-fetch, multiplication and addition operations.
Uh oh.. (Score:5, Funny)
10 PRINT "HELLO WORLD"
20 GOTO 10
AMD is always innovating.
Re:Uh oh.. (Score:2)
This loop won't crash. Memory fetch, addition and multiplication, remember ? So you'd need something like this:
10 I = 10
20 K = I
30 K = K + 2
40 K = K * 2
50 GOTO 20
Re:Uh oh.. (Score:3, Interesting)
10 I = 10.1
20 K = I
21 K2 =I
22 K3= I
23 K4= I
30 K = K + 2.1
40 K = K * 2.1
50 K2 = K2 + 2.1
60 K2 = K2 * 2.1
70 K3 = K3 + 2.1
80 K3 = K3 * 2.1
90 K4 = K4 + 2.1
100 K4 = K4 * 2.1
50 GOTO 20
Deja Vu: Intel Processor's Bug in 1994 (Score:3, Insightful)
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
Re:Deja Vu: Intel Processor's Bug in 1994 (Score:4, Informative)
forth paragraph in TFA.
Re:Deja Vu: Intel Processor's Bug in 1994 (Score:2)
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
AMD have probably learn
CALL ESP (Score:4, Interesting)
According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.
If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.
Melissa
Re:Deja Vu: Intel Processor's Bug in 1994 (Score:2, Insightful)
If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.
nice! (Score:3, Interesting)
Judging from the posting date, I *really* need to be updating my sources more often.
20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.
(could be an unrelated correction, I guess, it doesn't provide much more information in
Re:nice! (Score:2)
FXSAVE and FXSTOR [freebsd.org]
Re:nice! (Score:3, Informative)
source [net-security.org]
Re:nice! (Score:2)
It's like you're overclocking when you're not (Score:5, Insightful)
99% of reported Pentium bugss were program flaws (Score:2)
Re:99% of reported Pentium bugss were program flaw (Score:2)
I don't consider this as bad as the Sept.1998 batch of K6-2 450Mhz CPUs that could not run certain 32bit code AT ALL (neit
Could be worse (Score:2, Funny)
Prime95 as a detection tool? (Score:2, Informative)
Could Prime95 be used to identify those AMD chips?
Re:Prime95 as a detection tool? (Score:2)
Re:Prime95 as a detection tool? (Score:2)
Quality Control at AMD must be good. (Score:5, Interesting)
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.
This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.
This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.
Re:Quality Control at AMD must be good. (Score:3, Informative)
The actions needed to cause the problem to arise are so extreme that they'd never happen in the field.
This kind of thing is standard practice. If you want to stress test a piece of hardware, you write specialised test code which will consume the maximum amount of power possible, not a real world program. You have to be sure that nobody will be able to write software which will drive
Re:Quality Control at AMD must be good. (Score:2)
I am not so sure. The TFA said millions of instructions and the chips are capable of billions. So with HZ=100 there is room enough for 28e6 instructions to be ex
This will not happen to you (Score:5, Informative)
So under normal conditions on normal PC hardware, this simply won't happen.
NOOOOOOOO! (Score:2)
Tom
Re:NOOOOOOOO! (Score:2)
phew..
Tom
Coincident Advertising (Score:2)
Humanly reproducable :) (Score:2)
I thought it was the graphic card at first, but the type of crash I've been experiencing and the difficulty to reproduce it (I generally have to play AT with a pro gamer and go on about a 7 game win streak to get game conditions right for the crash) and it does have to be warm in my room...
WC3TFT can reproducably cre
Re:Humanly reproducable :) (Score:2, Insightful)
Surprising. AMD uses my `cpuburn` (Score:5, Informative)
Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.
But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.
Re:Surprising. AMD uses my `cpuburn` (Score:2)
Care to go into a bit more detail for us noobs?
This flaw seems damned serious to me... (Score:2, Insightful)
You are apt to be doing this extensively when processing audio or video streams.
Interesting! (Score:2, Interesting)
Re:What? (Score:3, Interesting)
Re:What? (Score:3, Interesting)
So getting heat local to the FPU isn't too surprising. There are various
Corruption (Score:3, Informative)
Re:Corruption (Score:5, Insightful)
Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.
Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...
Re:Corruption (Score:2)
However, when a CPU is KNOWN DEFECTIVE in a repeatable, data-corrupting way, it is the vendor's responsibility to replace/fix it.
Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.
smash.
Re:Corruption (Score:3, Informative)
Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.
I have actually studied this bug, and it is only observed when the fpu code is iterated in the MIL
Mainframe ;) (Score:2)
Re:Corruption (Score:2)
Re:What? (Score:2)
Since a normal temperature of functionning is written in the specifications of the hip
Re:What? (Score:2)
I can't find any written instructions on my hip. Which is another piece of circumstantial evidence of my theory that my parents bought me from a chinese clone factory.
Re:Kernel fix? (Score:5, Insightful)
I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.
Overclocking ... (Score:2)
And then end users will overclock these CPU
Re:Kernel fix? (Score:2)
Turn its clock down, right, yep done that.
So now ill never be affected by this obscure glitch that is almost totaly unreproducable outside of synthetic testing, oh thanks very much.
can i have the check now please ?
*check arives*
*cashes check*
*clocks cpus back up*
Re:Kernel fix? (Score:2)
Re:Kernel fix? (Score:2)
Re:Kernel fix? (Score:5, Informative)
Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.
Mod parent up please (Score:2)
There's no way the kernel can do anything about it, from the description of the problem.
And, contrary to AMD's attempts to downplay this issue, there are two immediate areas that I can think of which are affected. The first are certain scientific calculations (even worse, those involving Beowulf clusters). The
Re:Mod parent up please (Score:2)
For the time intensive calculations, people actually spend a lot of time optimizing the code. First they put it into assembly; and then they pour over every single assembly statement. You set, a tiny efficiency tweak, saving X number of cycles does indeed add up if you're running it for days or weeks at a time.
This is why I mentioned that mods to gcc might be a solution, but I doubt any chan
Re:Kernel fix? (Score:2)
Linux is immune (Score:2)
Crank up the clock rate even more if you are worried and you just have to run your CPU in tropical temperatures. You could also ping flood the machine, causing plenty off network interrupts.
Re:Quality Assurance? (Score:2)
Also, an AMD chip is only rated up to around 75^C anyway from memory.
Re:Quality Assurance? (Score:2)
However, very few people cared because very few people use itanium chips, and those who do are used to them not performing as advertised.
Re:Quality Assurance? (Score:2)
Hard to say. This is a design margin thing, depending upon worst case conditions plus localized heating, and localized heating (AFAIK) isn't generally modeled. Writing test vectors to find all logic errors is difficult, unpleasant, and labor intensive work. Even if software identifies the worst case path, it won't account for localized heating.
I'd guess there are other problems out there li
Re:Sounds familiar (Score:2)
There, that wasn't so hard to think of?
smash.
Re:Sounds familiar (Score:5, Informative)
Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.
You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.
All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...
smash.
Re:Sounds familiar (Score:2)
You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
Even forgetting that it's just the moral thing to do...Risk vs
Re:Haha (Score:2, Insightful)
Re:Phew, I'm not affected (Score:2)
Obnoxious? (Score:2)