Flawed AMD Chip Can Lead To Data Corruption

Follow Slashdot stories on Twitter

Flawed AMD Chip Can Lead To Data Corruption 203

Posted by Zonk on Saturday April 29, 2006 @01:39AM from the crunchy-mistakes dept.

Brandonski writes "Apparently AMD allowed some flawed chips to slip through their detection grid. The problem affects only a small number of chips and only single core 2.6 and 2.8 GHz CPUs." From the article: "It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localized heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption."

This discussion has been archived. No new comments can be posted.

Flawed AMD Chip Can Lead To Data Corruption

Load All Comments

Search 203 Comments Log In/Create an Account

Comments Filter:

I dub thee (Score:2, Funny)

by Anonymous Coward writes:

Fetch Div, son of Eff Div, Heir to Count Zero, and Lord of a new generation of digital serfs, soon to be labled as having "emotional problems."
- Re:I dub thee (Score:2)
  
  by Overly Critical Guy ( 663429 ) writes:
  
  Since this is an AMD problem, expect lots of justifications and defensiveness compared to if this was an Intel problem.
  - Re:I dub thee (Score:2)
    
    by Anarke_Incarnate ( 733529 ) writes:
    
    If you read the article, it sounded like some sort of apologistic approach anyhow: "If star A is in Uranus while star B is in Neptune, and you look at a pr0n site, then there is a chance that the girl's boobies will appear larger than they are" Basically it states that if the thermals are bad, and you cause a hotspot on the CPU, that there MAY be corruption in the data processed. Thats a bit of a stretch. Its not good, but its not major.
I Have an AMD CPU (Score:5, Funny)

by ozmanjusri ( 601766 ) writes: <aussie_bob@hoMOSCOWtmail.com minus city> on Saturday April 29, 2006 @01:50AM (#15226384) Journal

Hey, I have an AMD 2.8Ghz. Maybe I should stop refresðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{

Share
twitter facebook
- Re:I Have an AMD CPU (Score:5, Funny)
  
  by zaguar ( 881743 ) writes: on Saturday April 29, 2006 @02:52AM (#15226564)
  
  ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
  Interesting Perl script.
  
  Parent Share
  twitter facebook
  - Re:I Have an AMD CPU (Score:5, Funny)
    
    by Minwee ( 522556 ) writes: <dcr@neverwhen.org> on Saturday April 29, 2006 @09:33AM (#15227488) Homepage
    
    ðN9'óI]öR9ù¥Î6ýPoe}+èa(ê{
    Interesting Perl script.
    It's also rule number 26 in sendmail.cf.
    
    Parent Share
    twitter facebook
    - Re:I Have an AMD CPU (Score:2)
      
      by Dolda2000 ( 759023 ) writes:
      
      It's also "Hello World" in APL.
  - Re:I Have an AMD CPU (Score:2)
    
    by chris_eineke ( 634570 ) writes:
    
    Yeah, I tested it out...
    
    It's an operating system.
    
    With drivers.
    
    And GUI.
    
    And emacs. ;)
- Re:I Have an AMD CPU (Score:2)
  
  by arkhan_jg ( 618674 ) writes:
  
  You have a 2.8Ghz Opteron in your desktop PC at home? Don't you have someone to press the refresh button for you?
  - Re:I Have an AMD CPU (Score:2, Funny)
    
    by ozmanjusri ( 601766 ) writes:
    
    Don't you have someone to press the refresh button for you?
    Yeah, but unfortunately I hired a MCSE and it's turning out to be tougher than I thought training him.
An old problem (Score:5, Informative)

by AndrewStephens ( 815287 ) writes: on Saturday April 29, 2006 @01:53AM (#15226392) Homepage

Something similar used to happen on very old processors, back in the day. If certain instructions were executed in tight loops, the chips would experience localised heating and eventually malfunction (sometimes with permanent damage).
I'm too young to remember the details (I think it goes back to the early eighties at least), but perhaps some of the elder gods that lurk around here might be able to supply more details.

Share
twitter facebook
- Re:An old problem (Score:3, Funny)
  
  by Alien Being ( 18488 ) writes:
  
  I used to burn out a lot of abacus beads.
  - Re:An old problem (Score:2)
    
    by dhall ( 1252 ) writes:
    
    Beads? Using our sexagesimal system, we didn't have the true concept of zero as a number!
- Re:An old problem (Score:5, Funny)
  
  by Jerf ( 17166 ) writes: on Saturday April 29, 2006 @02:20AM (#15226480) Journal
  
  Do not meddle in the affairs of the Elder Gods [wikipedia.org], for you are crunchy, and good with ketchup.
  
  Parent Share
  twitter facebook
- Re:An old problem (Score:2, Informative)
  
  by Dadoo ( 899435 ) writes:
  
  If certain instructions were executed in tight loops, the chips would experience localised heating and eventually malfunction (sometimes with permanent damage).
  
  You're thinking about magnetic cores.
  
  Whenever you reverse a core's magnetic field, its temperature rises a little. Keep reversing the field fast enough and for a long enough period of time, and the core (or maybe the wires running through it?) will melt, permanently damaging that bit.
  - Re:An old problem (Score:3, Funny)
    
    by myowntrueself ( 607117 ) writes:
    
    "Keep reversing the field fast enough and for a long enough period of time, and the core (or maybe the wires running through it?) will melt"
    
    Shit, so for once reversing the polarity does more harm than good?
- Re:An old problem (Score:5, Informative)
  
  by Mister Transistor ( 259842 ) writes: on Saturday April 29, 2006 @02:41AM (#15226538) Journal
  
  You may be referring to the early MC6800 8-bit processors. The first ones had a major problem in that the internal registers were dynamic RAM style memory, and synchronized to the internal state clock. If you halted the processor for an extended period of time, the refresh clock to them ceased and the registers got hot, drew too much current and burned up!
  
  I'm pretty sure that gave rise to the joke "Halt and Catch Fire"...
  
  I always figured that if you were to burn out a register from overuse, it would be the carry bit ;)
  
  Anyway, as to the story at hand, it sounds like this would only ever occur a) to only 3000 processors total - MAYBE, and b) would only ever happen under such an artifically contrived laboratory stress-test/benchmark situation. Any CPU running in a real system would a) have to do other things like service hardware interrupts, and b) wouldn't do something useless like perform a looping calculation without checking to see if it was done periodically. It really sounds like this is a big non-issue in reality.
  
  Parent Share
  twitter facebook
  - Re:An old problem (Score:3, Insightful)
    
    by AndrewStephens ( 815287 ) writes:
    
    I agree with your comments on the current story. In reality, all modern processors have flaws that only occur in extrememly unlikely circumstances. This one is not any different.
    - Re:An old problem (Score:5, Insightful)
      
      by Mister Transistor ( 259842 ) writes: on Saturday April 29, 2006 @03:05AM (#15226596) Journal
      
      I'll go you one better - I have formed my own personal postulate/theory/law that:
      
      No sufficiently complex system can ever be completely bug-free.
      
      and it's corollary:
      
      It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
      
      In that vein, someone once said "Foolproof is impossible because fools are so ingenious", and "As soon as an idiot-proof system is devised, they go and invent a better idiot!"
      
      Parent Share
      twitter facebook
      - Professor Turing once contemplated this... (Score:2)
        
        by mosel-saar-ruwer ( 732341 ) writes:
        
        I have formed my own personal postulate/theory/law... and it's corollary: It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
        Along those lines - many years ago, Professor Turing set out to find a test for [among other things] the possible presence of an infinite loop within a computer program.
        Sadly, though, he didn't get very far with that line of inquiry... [google.com]
      - Re:An old problem (Score:2)
        
        by QuantumFTL ( 197300 ) * writes:
        
        No sufficiently complex system can ever be completely bug-free.
        
        and it's corollary:
        
        It is impossible to completely test a sufficiently complex system in every possible way to be certain that it's bug-free.
        
        This is why we have automated reasoning systems, theorom provers, etc. They allow us to reduce the set of all possible states down to a set of orthogonal equivilence classes, only one example from which need to actually be tested.
        
        Now, of course, at some point non-ideal physical characteristics can
      - Re:An old problem (Score:3, Interesting)
        
        by Soul-Burn666 ( 574119 ) writes:
        
        Actually hardware IS different. As complex as hardware is, it is much less complex than software and has much simpler logic to check. This allows for systems for "formal verification" which happen to work exceedingly well for hardware. For example IBM's "RuleBase" is a system that uses temporal logic to verify a certain piece of "code" (which will later be compiled to hardware) against a set of logical rules.
        When the system can be used, it helps clear out logic bugs very efficiently.
        
        That being said, today's
      - Re:Woah! (Score:2)
        
        by Mister Transistor ( 259842 ) writes:
        
        I get a calculator! Actually, I get 54. Why do you ask?
    - Re:An old problem (Score:2)
      
      by smash ( 1351 ) writes:
      
      Now, not having a go at the parent post, but if intel was to release a statement like this, the /. community would be having a field day.
      A defect that is known to give incorrect calculations is a serious issue that should be rectified via microcode update or exchange CPU for free (if microcode can't fix it).
      Intel got raked over the coals for the FDIV problem, and so should AMD unless they do the right thing and offer an exchange/free fix so that users get the functional CPU they intended to purchase.
      s
      - Re:An old problem (Score:5, Informative)
        
        by something_wicked_thi ( 918168 ) writes: on Saturday April 29, 2006 @04:14AM (#15226767)
        
        RTFA. They are offering a free replacement. However, the FDIV bug was overblown. For most people, it didn't matter (few people were using software that required division precise enough to be affected). This bug is even less worrisome. Its effect is, at the moment, completely unobserved in the wild using real world applications. The FDIV bug was apparent to anyone with a calculator.
        
        I'm not saying AMD should be let off the hook completely, but the bug isn't a big problem, they are offering free replacements, and they publicized it. The FDIV bug was bigger (though still hardly catastrophic), refused (at first) to offer replacements, and they sat on it. The two scenarios are nowhere near similar. Maybe AMD just has more character than Intel, or maybe they were watching in 94/95 when the FDIV bug happened and they've actually learned from Intel's mistakes. Regardless, this whole story is more of a heads-up to concerned buyers than a criticism of AMD.
        
        Parent Share
        twitter facebook
        
        Re:An old problem (Score:4, Insightful)
        
        by AWhistler ( 597388 ) writes: on Saturday April 29, 2006 @08:50AM (#15227345)
        
        There is a HUGE difference between this AMD problem and the FDIV bug. The FDIV bug, once found, was one of those "1,2, BANG" bugs (do step 1, then step 2 and BANG, the bug is there). With this AMD bug, you have to do the same operation many times before you see the problem, and then the problem is random (only if it overheats enough). Another possible solution to this is to use better heat sinks. This AMD problem isn't 1,2,BANG. Bugs that are of this nature are orders of magnitude harder to find and characterize.
        
        But you're right, since Intel blundered so badly on their handling of he FDIV bug, everyone else learned from it.
        
        Parent Share
        twitter facebook
        
        Re:An old problem (Score:2)
        
        by Sebastopol ( 189276 ) writes:
        
        What exactly do you mean "blundered badly"?
        
        It is a textbook case in many MBA programs how _WELL_ Intel handled this.
        
        They recalled EVERY CPU at their own expense of millions of dollars. Managing the recall, the disposal, the resupply, the competition, AND the PR nightmare was handled so well that this incident has become canon for MBA candidates.
        
        Re:An old problem (Score:3, Interesting)
        
        by AWhistler ( 597388 ) writes:
        
        Then you REALLY need to get new MBA textbooks, since the one you have been reading is too politically correct to be useful. Here is a link from the guy who discovered the bug which includes a timeline (I can't believe his FAQ is still online!)...
        
        http://www.trnicely.net/pentbug/pentbug.html/ [trnicely.net]
        
        Pay close attention to questions 9, 10, and 11. It explains what REALLY happened, and the author's opinions on the matter, which to my memory are quite accurate. How do I know? At the time I owned a Gateway Pent
        
        Re:An old problem (Score:2)
        
        by AWhistler ( 597388 ) writes:
        
        This is what made the FDIV bug so insidious. If you only bought one brand of computer in your lab (or in your home), which even today is a very common practice, you would always get the same results, never thinking anything is wrong. If you published your results (financial, scientific) or were responsible for something critical (space launches, health care), you could lose millions of dollars, any scientific credibility, expensive equipment, or even cause peoples' deaths. Even if the possibility of this
        
        Re:An old problem (Score:2)
        
        by jizmonkey ( 594430 ) writes:
        
        What exactly do you mean "blundered badly"? It is a textbook case in many MBA programs how _WELL_ Intel handled this. They recalled EVERY CPU at their own expense of millions of dollars. Managing the recall, the disposal, the resupply, the competition, AND the PR nightmare was handled so well that this incident has become canon for MBA candidates.
        My lord, this reinforces just about every stereotype of b-school students I developed while living in Schwab.
        First Intel refused to replace the chips, except fo
        
        Re:An old problem (Score:3, Interesting)
        
        by LurkerXXX ( 667952 ) writes:
        
        What the hell kind of crappy MBA program did you go to? Intel did *NOT* handle it well. I had one of those CPUs. Intel tried to tell me (a scientific researcher) that my computations didn't need that level of FPU accuracy, and that they wouldn't replace it. It was only after we, the users, screamed bloody murder and brought lawsuits that they decided to back down and replace them all.
        The PR nightmare was *caused* specifically by the way Intel handled the discovery. They thought that they had the righ
  - Re:An old problem.. now usedto fight the overlords (Score:2)
    
    by __aaijsn7246 ( 86192 ) writes:
    
    These flaws only occur in unlikely circumstances, but they will be useful tools when fighting our new computer overlords.
- - Re:An old problem (Score:2)
    
    by Tim C ( 15259 ) writes:
    
    Yes and no; I've never heard any such rumour about the ZX80 or the BBC Micro. I heard that POKEing a certain memory location on the Commodore Pet would cause it to burst into flames, but never saw it happen so can't confirm it. A quick google turned up this page [old-computers.com], which has details about the Pet rumour and the BBC Micro one, but nothing about the ZX80.
Fearmongering? (Score:2, Interesting)

by zaguar ( 881743 ) writes:

Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow - which is what this AMD weakness is purporting to allow. However, how many are affected? Only a few of the AMD chips, and AMD has only what, 30% of the market. So to code an exploit, you would be writing to a very limited audience, to a point where it is futile. Why not just exploit the latest create.Textrange of WMF exploit in IE/Windows? Much more money in that.
- Re:Fearmongering? (Score:2, Insightful)
  
  by Saven Marek ( 739395 ) writes:
  
  > Only a few of the AMD chips, and AMD has only what, 30% of the market.
  
  The intel fanboys have been too noisy lately! AMD has more than 50% of the market since this year already!
- Fearmongering? No, you misunderstand ... (Score:3, Informative)
  
  by AHumbleOpinion ( 546848 ) writes:
  
  Is it reasonable to be afraid of this. To exploit this, in a way to allow running of arbitary code, you would need a buffer overflow
  
  I think you are misunderstanding the nature of the problem. This is not data corruption as in buffer overflow, this is data corruption as in the calculation comes up with an incorrect answer. For some people that is not acceptible.
  - Re:Fearmongering? No, you misunderstand ... (Score:2, Funny)
    
    by larry bagina ( 561269 ) writes:
    
    yeah, 1.0 + 1.0 = 3.0, for sufficiently large values of 1.0
- "allowed...to slip through...detection grid..." (Score:2)
  
  by CarpetShark ( 865376 ) writes:
  
  I don't know about fearmongering, but it's certainly going to some trouble to make a false accusation. Either it slipped through their "detection grid", or it was detected and ignored. It can't have been both.
Obligatory (Score:2)

by suv4x4 ( 956391 ) writes:

loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations

I've been saying that for ages, check your results, but naah! Them young'uns and their series of memory-fetch, multiplication and addition operations.
Uh oh.. (Score:5, Funny)

by BigZaphod ( 12942 ) writes: on Saturday April 29, 2006 @02:11AM (#15226450) Homepage

Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:

10 PRINT "HELLO WORLD"
20 GOTO 10

AMD is always innovating.

Share
twitter facebook
- Re:Uh oh.. (Score:2)
  
  by ultranova ( 717540 ) writes:
  
  Wow! AMD has invented a way to crash an infinite loop! Awesome! Intel? I bet their solution will take twice as long to crash this loop:
  
  10 PRINT "HELLO WORLD"
  20 GOTO 10
  
  This loop won't crash. Memory fetch, addition and multiplication, remember ? So you'd need something like this:
  
  10 I = 10
  20 K = I
  30 K = K + 2
  40 K = K * 2
  50 GOTO 20
  - Re:Uh oh.. (Score:3, Interesting)
    
    by JollyFinn ( 267972 ) writes:
    
    No that won't crash its FLOATING POINT memory fetch, addition and multiplication loop! Then we need to unroll the loop enough to hide the floatingpoint unit latency. So that it stays active.
    
    10 I = 10.1
    20 K = I
    21 K2 =I
    22 K3= I
    23 K4= I
    30 K = K + 2.1
    40 K = K * 2.1
    50 K2 = K2 + 2.1
    60 K2 = K2 * 2.1
    70 K3 = K3 + 2.1
    80 K3 = K3 * 2.1
    90 K4 = K4 + 2.1
    100 K4 = K4 * 2.1
    50 GOTO 20
Deja Vu: Intel Processor's Bug in 1994 (Score:3, Insightful)

by reporter ( 666905 ) writes: on Saturday April 29, 2006 @02:23AM (#15226491) Homepage

In 1994, Intel's Pentium processor suffered from a division error [willamette.edu]. Intel handled the problem by initially requiring customers to "prove" that the error caused a serious impact on the customers' lives before Intel would agree to replace the defective chips. Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.
AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?

Share
twitter facebook
- Re:Deja Vu: Intel Processor's Bug in 1994 (Score:4, Informative)
  
  by Anonymous Coward writes: on Saturday April 29, 2006 @02:28AM (#15226505)
  
  "The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements."
  
  forth paragraph in TFA.
  
  Parent Share
  twitter facebook
- Re:Deja Vu: Intel Processor's Bug in 1994 (Score:2)
  
  by cowbutt ( 21077 ) writes:
  
  Later, after much pressure and lost credibility, Intel agreed to replace all the defective chips without requiring the customer to "prove" his case.
  AMD has a unique opportunity to do the right thing: offering to replace all the defective chips. If AMD does the right thing, then it will only help AMD in its litigation against Intel and in various attempts to increase marketshare. After all, would you not prefer to buy from a reputable company instead of a dishonest, shifty company?
  AMD have probably learn
  - CALL ESP (Score:4, Interesting)
    
    by Myria ( 562655 ) writes: on Saturday April 29, 2006 @03:42AM (#15226680)
    
    Probably the easiest errata to come by is the instruction "CALL ESP" (or "CALL RSP"). On AMD CPUs, "CALL ESP" will jump to the address in ESP, *then* push the return address. However, on Intel CPUs, it will push the return address first, then jump to the value it just pushed. This is, of course, disasterous if you try to use it.
    
    According to Intel errata documents, this is a bug in the Pentium Pro that has been kept for several generations. The Pentium and below, except the 8086 and 8088, worked correctly with this instruction.
    
    If you want to differentiate Intel and AMD in your program and don't want to use CPUID, you can set up a test with CALL ESP.
    
    Melissa
    
    Parent Share
    twitter facebook
- Re:Deja Vu: Intel Processor's Bug in 1994 (Score:2, Insightful)
  
  by mojotooth ( 53330 ) writes:
  
  Jesus. The things that people attribute to AMD's "moral superiority" here on Slashdot... It's astounding.
  
  If AMD does "the right thing" it won't be because of a moral high road. It's because Intel already stepped on a similar PR landmine long ago. Learning from your rival's huge mistakes is not worth high praise. It's just common sense.
nice! (Score:3, Interesting)

by B3ryllium ( 571199 ) writes: on Saturday April 29, 2006 @02:28AM (#15226506) Homepage

Wow, that was fast. FreeBSD already has a patch for this.

Judging from the posting date, I *really* need to be updating my sources more often. :)

20060419: p7 FreeBSD-SA-06:14.fpu
Correct a local information leakage bug affecting AMD FPUs.

(could be an unrelated correction, I guess, it doesn't provide much more information in /usr/src/UPDATING)

Share
twitter facebook
- Re:nice! (Score:2)
  
  by B3ryllium ( 571199 ) writes:
  
  Ah, I believe I may be incorrect - the longer description sounds like an unrelated FPU bug:
  
  FXSAVE and FXSTOR [freebsd.org]
- Re:nice! (Score:3, Informative)
  
  by larry bagina ( 561269 ) writes:
  
  it is an unrelated correction:
  ...As a result of this discrepancy remaining unnoticed until now, the FreeBSD kernel does not restore the contents of the FOP, FIP, and FDP registers between context switches.
  source [net-security.org]
  - Re:nice! (Score:2)
    
    by B3ryllium ( 571199 ) writes:
    
    Caught that already. Sorry for the disinformation. :)
It's like you're overclocking when you're not (Score:5, Insightful)

by IvyMike ( 178408 ) writes: on Saturday April 29, 2006 @02:50AM (#15226554)

This is different than the Intel bug; that was a logic flaw, where the chip computed a floating point quantity using an incorrect algorithm. This is an implementation error. In fact, the article mentions that they're going to re-spec the parts and they'll be fine. So if you've got a 2.8Ghz part, and you run this loop at 2.8Ghz (within the old spec), it's like you're "overclocking" (because you're actually outside of AMD's new spec). My guess is that if you over-bought your heatsink and got something better than the stock OEM cooling solution, you would be fine even if you ran this loop all day. Yay, arctic silver!

Share
twitter facebook
- 99% of reported Pentium bugss were program flaws (Score:2)
  
  by expro ( 597113 ) writes:
  
  Floating point is hopelessly problematic for the average programmer and too many average programmers wrote the programs from Excel to MS Calculator and by any number of other vendors, all of which had "Pentium bugs" reported, that didn't need particular Intel hardware to be reproduced.
  - Re:99% of reported Pentium bugss were program flaw (Score:2)
    
    by Reziac ( 43301 ) * writes:
    
    I have a P90 (one of those that was remarked down to P75 for the market sweet spot, but because it's really a P90, it runs fine at 90MHz) that has some sort of FP bug... it passes the Calculator test, but locks up with certain math-intensive screen savers, like the old After Dark kaleidoscope. It never showed any other symptoms in its 6 years of useful life, so I didn't bother to RMA it.
    
    I don't consider this as bad as the Sept.1998 batch of K6-2 450Mhz CPUs that could not run certain 32bit code AT ALL (neit
Could be worse (Score:2, Funny)

by Khith ( 608295 ) writes:

Just imagine if you had one of those Pinnacle chips and accidently pressed @[=g3,8d]\&fbb=-q]/hk%fg followed by delete..
Prime95 as a detection tool? (Score:2, Informative)

by Antiocheian ( 859870 ) writes:

I have used Prime95 in the past to identify problematic configurations. It's a tool whose main goal is to find prime numbers, but it can be used as an excellent stress test for the processor and memory units.

Could Prime95 be used to identify those AMD chips?
- Re:Prime95 as a detection tool? (Score:2)
  
  by smallfries ( 601545 ) writes:
  
  It seems unlikely. The synthetic benchmark that they describe is four particular FP instructions, repeated several million times. Firstly no compiler would unroll a loop that far. Secondly, the four instructions would never occur in practice because you would need some memory accesses at some point to load/store data. Between the loop control, and the memory accesses, the FP instructions are broken up enough that they don't trigger the failure.
- - Re:Prime95 as a detection tool? (Score:2)
    
    by ettlz ( 639203 ) writes:
    
    The GIMPS clients do use the FPU (read the FAQ). Something to do with intensive use of FFTs.
Quality Control at AMD must be good. (Score:5, Interesting)

by MROD ( 101561 ) writes: on Saturday April 29, 2006 @03:46AM (#15226690) Homepage

Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.

The actions needed to cause the problem to arise are so extreme that they'd never happen in the field. i.e. Loop through tight floating-point only instructions without any comparisons for maybe hours before the error occurs.

This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.

This is a theoretical problem only in the real world, especially as it only affects about 3000 processors in total (it has been quoted). This is why AMD gave it such a low priority. We should just forget about it and move on.

Share
twitter facebook
- Re:Quality Control at AMD must be good. (Score:3, Informative)
  
  by kinnell ( 607819 ) writes:
  
  Having read a lot about this flaw it's actually amazing that AMD's quality control found the problem in the first place.
  The actions needed to cause the problem to arise are so extreme that they'd never happen in the field.
  This kind of thing is standard practice. If you want to stress test a piece of hardware, you write specialised test code which will consume the maximum amount of power possible, not a real world program. You have to be sure that nobody will be able to write software which will drive
- Re:Quality Control at AMD must be good. (Score:2)
  
  by Ruie ( 30480 ) writes:
  
  This would *NEVER* happen in the field. Firstly, in any modern OS the process would have been pre-empted long before any problem could occur (causing other instructions to run and hence stopping the overheating). Secondly, no real-world program would ever do this sort of thing as there would always be a comparison in the loop within the timeframe.
  I am not so sure. The TFA said millions of instructions and the chips are capable of billions. So with HZ=100 there is room enough for 28e6 instructions to be ex
This will not happen to you (Score:5, Informative)

by Bloater ( 12932 ) writes: on Saturday April 29, 2006 @04:38AM (#15226828) Homepage Journal

If you have any interrupts coming in, or your loop has a termination condition. I think you have to have your hardware set to send an interrupt many hours in the future then start an otherwise nonterminating loop.

So under normal conditions on normal PC hardware, this simply won't happen.

Share
twitter facebook
NOOOOOOOO! (Score:2)

by tomstdenis ( 446163 ) writes:

... I just got my pair of 285s! ... well fortunately I don't do a lot of FPU work like that. That and I run cpufreq in "ondemand" mode so I don't care about heat...

Tom
- - Re:NOOOOOOOO! (Score:2)
    
    by tomstdenis ( 446163 ) writes:
    
    Admitedly I didn't RTFA until after I posted...
    
    phew..
    
    Tom
Coincident Advertising (Score:2)

by lildogie ( 54998 ) writes:

Funny, the ad that appeared on the comments page had some code [falkag.net] P.S. Anyone remember the HCF instruction (halt and catch fire).
Humanly reproducable :) (Score:2)

by kesuki ( 321456 ) writes:

Although the article specifies 2.6 and 2.8 ghz opterons, I've crashed my Venice core 3000+ socket 754 7 times from online gaming conditions generated by a particilar application (warcraft 3 TFT)

I thought it was the graphic card at first, but the type of crash I've been experiencing and the difficulty to reproduce it (I generally have to play AT with a pro gamer and go on about a 7 game win streak to get game conditions right for the crash) and it does have to be warm in my room...

WC3TFT can reproducably cre
- Re:Humanly reproducable :) (Score:2, Insightful)
  
  by fimbulvetr ( 598306 ) writes:
  
  I think someone's confusing user error/not enough troubleshooting with an almost not reproducable issue. TFA mentions a lot of instructions without enough pause of FPU code to cool down. This isn't your bug if you're playing WC3. WC3 uses TCP/IP. TCP/IP generates interrupts - lots of interrupts. So many interrupts that your FPU has plenty of time to cool down between calculations. There are many handy ways of troubleshooting this issue of yours, and I'd bet you're not going to identify the problem by some s
Surprising. AMD uses my `cpuburn` (Score:5, Informative)

by redelm ( 54142 ) writes: on Saturday April 29, 2006 @12:27PM (#15228282) Homepage

About 7 years ago, I wrote a suite of open-source CPU stress-tests I called `cpuburn` [sbcglobal.net]. Little optimized assember pgms designed to stress different parts of the CPU. `burnK7` does precisely this FPU dot product.
Of course, I expect AMD's production testing dept to have far better code, since they will devote more job hours to it and know proprietary chip details. Still, different parts of AMD as emailed me several times to thank me because they found the pgms useful. Great.
But these guys know what they're doing. Heat transfer from the hot multipliers has to be carefully analysed [3D finite element heat transfer analysis]. I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.

Share
twitter facebook
- Re:Surprising. AMD uses my `cpuburn` (Score:2)
  
  by fimbulvetr ( 598306 ) writes:
  
  I suspect something far more mundane, like someone reducing die or slug thickness, or a mfg problem with the die/slug gap or thermal goop.
  
  Care to go into a bit more detail for us noobs?
This flaw seems damned serious to me... (Score:2, Insightful)

by anubi ( 640541 ) writes:

... because the multiply-add is the basic building block of digital signal processing.
You are apt to be doing this extensively when processing audio or video streams.
Interesting! (Score:2, Interesting)

by seebs ( 15766 ) writes:

A friend of mine and I can reliably crash some similar-generation AMD chips with a loop setting a region of memory to all zeroes, but not with a loop setting it to 0xaaaaaaaa. The chips just lock up. Takes anywhere from a few seconds (linux) to a few minutes (windows).
- Re:What? (Score:3, Interesting)
  
  by qbwiz ( 87077 ) * writes:
  
  Generally, chips aren't supposed to have localized heating problems. Either it should all have a problem, or none of it should.
  - Re:What? (Score:3, Interesting)
    
    by tomstdenis ( 446163 ) writes:
    
    There are two parts to that. First off, the composition of the die is varied. Some parts are the ALU, FPU, cache, etc. So depending where the current is going changes the heat [no duh]. The FPU is particularly nasty as unlike the ALU it takes at least 2 EX cycles to do anything and most complicated instructions are at least 4 EX cycles. This means something in the FPU is running for 4 cycles at a time, cannot be interrupted, etc.
    
    So getting heat local to the FPU isn't too surprising. There are various
- Corruption (Score:3, Informative)
  
  by XanC ( 644172 ) writes:
  
  Corruption is the cardinal sin of a CPU. If it can't compute a result accurately, it should shut down rather than give a wrong answer.
  - Re:Corruption (Score:5, Insightful)
    
    by leendertv ( 969527 ) writes: on Saturday April 29, 2006 @02:51AM (#15226558)
    
    No CPU can guarantee to be free of corruption, the goal of the designer is just to minimize the likelihood of corruption. The design margins are usually such that proper operation is ensured, except for the statistical outliers. However, even CPUs with several error checking and correcting mechanisms can still corrupt data, it is just extremely unlikely. A CPU can never know for sure if it can compute a result accurately, or if an operation was performed correctly, just like no communications system can achieve bit error rates of 0.
    
    Data corruption in integrated circuits can come from several different sources. Cosmic rays are likely to alter memory values, especially so in DRAM cells. Typically, only ICs for space applications are actually radiation hardened. Much less likely, transistor device noise can corrupt data. Transistor device noise is usually more an issue in RF circuits. Finally, not all manufacturing defects can be found during manufacturing test, since most test sequences don't even achieve 100% fault coverage under currently used fault models, and this does not even consider how closely the models represent the actually circuit failure modes.
    
    Really, for most people this floating point data corruption is probably a non-issue. It is even more unlikely that errors in floating point data lead to exploits. It is more likely that some bits of your DRAM memory will get corrupted. On my system with ECC RAM that is a few years old, logs show that I get about 1 or 2 (correctable) errors per day...
    
    Parent Share
    twitter facebook
    - Re:Corruption (Score:2)
      
      by smash ( 1351 ) writes:
      
      Granted, for most people this may well be a non-issue, and data corruption is a fact of life.
      However, when a CPU is KNOWN DEFECTIVE in a repeatable, data-corrupting way, it is the vendor's responsibility to replace/fix it.
      Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.
      smash.
      - Re:Corruption (Score:3, Informative)
        
        by caspper69 ( 548511 ) writes:
        
        Granted, for most people this may well be a non-issue, and data corruption is a fact of life. However, when a CPU is KNOWN DEFECTIVE in a repeatable, data-corrupting way, it is the vendor's responsibility to replace/fix it.
        
        Similar to vehicle recalls. Most people would never be affected by many of the things vehicles are recalled for, but that doesn't mean that known *serious* defects are simply let go.
        
        I have actually studied this bug, and it is only observed when the fpu code is iterated in the MIL
    - Mainframe ;) (Score:2)
      
      by kompiluj ( 677438 ) writes:
      
      Yes - the ability to take corruption into account is what differs mainframes [ibm.com] (and also high-end IBM UNIX servers like p595) from PCs.
    - Re:Corruption (Score:2)
      
      by afidel ( 530433 ) writes:
      
      This is why the old HP MIPS CPU's were so cool, every memory area was ECC and all calculations were run on two cores, if the cores disagreed then they ran the calculation again, if they disagreed again then the CPU shut down and the operation was offloaded to another CPU in the machine.
- Re:What? (Score:2)
  
  by Yvanhoe ( 564877 ) writes:
  
  Overheating leading to data corruption? Since when is this a flaw in chip design?
  
  Since a normal temperature of functionning is written in the specifications of the hip
  - Re:What? (Score:2)
    
    by LarsG ( 31008 ) writes:
    
    Since a normal temperature of functionning is written in the specifications of the hip
    
    I can't find any written instructions on my hip. Which is another piece of circumstantial evidence of my theory that my parents bought me from a chinese clone factory. ;-)
- Re:Kernel fix? (Score:5, Insightful)
  
  by Umbral Blot ( 737704 ) writes: on Saturday April 29, 2006 @02:00AM (#15226411) Homepage
  
  The big question is will someone write malware/virus to somehow take advantage of this flaw?
  
  I am curious how a virus could possibly exploit this. It would have to a) hog the resources so that it ran nearly exclusively, which would mean the virus already had control, and b) somehow cause a floating point error to result in a priviliages error. (priviliages and security routines rarely use floating point numbers). Also why would a kernel patch be released for this? It would hurt performance for the rest of us, customers with defective chips should simply return and replace them.
  
  Parent Share
  twitter facebook
  - - Overclocking ... (Score:2)
      
      by AHumbleOpinion ( 546848 ) writes:
      
      AMD says that from now on, chips that have this problem will be rerated to lower clock speeds..
      
      And then end users will overclock these CPU ...
    - Re:Kernel fix? (Score:2)
      
      by Lucractius ( 649116 ) writes:
      
      Yes i have a faulty cpu..
      
      Turn its clock down, right, yep done that.
      
      So now ill never be affected by this obscure glitch that is almost totaly unreproducable outside of synthetic testing, oh thanks very much.
      
      can i have the check now please ?
      
      *check arives*
      *cashes check*
      *clocks cpus back up*
      - Re:Kernel fix? (Score:2)
        
        by Lucractius ( 649116 ) writes:
        
        as long as i have the nice fat "refund" check for clocking down the cpus, who cares :P
      - Re:Kernel fix? (Score:2)
        
        by Lucractius ( 649116 ) writes:
        
        see the point was that in response to the GP, that they should give people money to down clock them themselves... and i was pointing out the inherent flaw that theres nothing stopping them from just pretending to have a faulty chip, pretending to underclock it, and leaving unchanged, and pocketing a pile of money
- Re:Kernel fix? (Score:5, Informative)
  
  by larry bagina ( 561269 ) writes: on Saturday April 29, 2006 @02:39AM (#15226534) Journal
  
  I'm sure someone will have a kernel patch to prevent this from happening in linux in very short order.
  Not likely. This is valid user code that is being executed. On other CPUs, the same code wouldn't cause a problem. Something like the F00F bug is fixable in the kernel by mucking with exception handler. This is pure user-land code.
  
  Parent Share
  twitter facebook
  - Mod parent up please (Score:2)
    
    by btarval ( 874919 ) writes:
    
    Agreed; the GP doesn't understand the problem. At best, you might modify gcc; but I suspect that might be a pain, considering it's such a limited problem (according to the rumor mentioned in TFA).
    There's no way the kernel can do anything about it, from the description of the problem.
    And, contrary to AMD's attempts to downplay this issue, there are two immediate areas that I can think of which are affected. The first are certain scientific calculations (even worse, those involving Beowulf clusters). The
    - - Re:Mod parent up please (Score:2)
        
        by btarval ( 874919 ) writes:
        
        Ummm, I take it you've never had to optimize FP code, have you? Nor have you even looked at code in this category.
        For the time intensive calculations, people actually spend a lot of time optimizing the code. First they put it into assembly; and then they pour over every single assembly statement. You set, a tiny efficiency tweak, saving X number of cycles does indeed add up if you're running it for days or weeks at a time.
        This is why I mentioned that mods to gcc might be a solution, but I doubt any chan
  - Re:Kernel fix? (Score:2)
    
    by Coppit ( 2441 ) writes:
    
    On the other hand, a compiler fix is plausible. The idea would be to avoid generating this kind of code. I'm sure some compiler gurus can point out precedent for this sort of thing.
  - Linux is immune (Score:2)
    
    by r00t ( 33219 ) writes:
    
    OK, so Windows is immune too. These operating systems have a clock tick that interrupts at 100, 250, or 1000 Hz. That interrupts the FPU.
    
    Crank up the clock rate even more if you are worried and you just have to run your CPU in tropical temperatures. You could also ping flood the machine, causing plenty off network interrupts.
- Re:Quality Assurance? (Score:2)
  
  by fabs64 ( 657132 ) writes:
  
  Actually it's very common for cpu manufacturers to just underclock overheating chips.
  Also, an AMD chip is only rated up to around 75^C anyway from memory.
- Re:Quality Assurance? (Score:2)
  
  by Bert64 ( 520050 ) writes:
  
  Intel had a very similar problem with some of their Itanium chips recently too, however i don't recall them offering free replacements, i believe they just told customers to clock down affected processors!
  
  However, very few people cared because very few people use itanium chips, and those who do are used to them not performing as advertised.
- Re:Quality Assurance? (Score:2)
  
  by ChrisMaple ( 607946 ) writes:
  
  "At least this bug was found. How many more like it are there, but we simply don't have the proper trace to find it?"
  Hard to say. This is a design margin thing, depending upon worst case conditions plus localized heating, and localized heating (AFAIK) isn't generally modeled. Writing test vectors to find all logic errors is difficult, unpleasant, and labor intensive work. Even if software identifies the worst case path, it won't account for localized heating.
  I'd guess there are other problems out there li
- Re:Sounds familiar (Score:2)
  
  by smash ( 1351 ) writes:
  
  Business spreadsheets (price = cost + (cost*markup%))? Scientific modelling?
  There, that wasn't so hard to think of?
  smash.
  - - Re:Sounds familiar (Score:5, Informative)
      
      by smash ( 1351 ) writes: on Saturday April 29, 2006 @04:08AM (#15226749) Homepage Journal
      
      Hmm.... I doubt you'd need a few million cells though.
      Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.
      You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
      Even forgetting that it's just the moral thing to do...Risk vs replacement cost = no brainer. If only 3000 cpus are affected at say $300 each for amd to sell retail (i'm sure their cost is FAR less), they'd be mad not to just do it (maybe even offer a free speed bump) and reap the positive PR.
      All it needs is for ONE company to blame a budget blowout on them and it's well and truly paid for...
      smash.
      
      Parent Share
      twitter facebook
      - Re:Sounds familiar (Score:2)
        
        by caspper69 ( 548511 ) writes:
        
        Hmm.... I doubt you'd need a few million cells though. Some of the tendering spreadsheets i've seen for a few companies i've worked for have had quite a lot of calculation going on in them - change a few cells that others depend on that have others depending on them, etc.... do that all day, it adds up quick.
        
        You only need 1 of those operations in that instance to screw up and you could be down a few million dollars, if it's not picked up.
        
        Even forgetting that it's just the moral thing to do...Risk vs
- Re:Haha (Score:2, Insightful)
  
  by WilliamSChips ( 793741 ) writes:
  
  When AMD has a problem, it only affects 3000 or so processors and causes minor corruption when a million-line-long piece of code is called without being stopped at any time. When Intel has a problem [linuxmafia.com] it affects millions of processors and crashes your computer when a single 32-bit command is called. I know whom I'll be buying from.
- Re:Phew, I'm not affected (Score:2)
  
  by springbox ( 853816 ) writes:
  
  I tried, but I think I might have one of the affected chips:
  
  C:\>cat /proc/cpu meow
- - Obnoxious? (Score:2)
    
    by freaker_TuC ( 7632 ) writes:
    
    The NO CARRIER joke was so nice but does not fit with this probl*NO DATA*

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

I dub thee (Score:2, Funny)

Re:I dub thee (Score:2)

Re:I dub thee (Score:2)

I Have an AMD CPU (Score:5, Funny)

Re:I Have an AMD CPU (Score:5, Funny)

Re:I Have an AMD CPU (Score:5, Funny)

Re:I Have an AMD CPU (Score:2)

Re:I Have an AMD CPU (Score:2)

Re:I Have an AMD CPU (Score:2)

Re:I Have an AMD CPU (Score:2, Funny)

An old problem (Score:5, Informative)

Re:An old problem (Score:3, Funny)

Re:An old problem (Score:2)

Re:An old problem (Score:5, Funny)

Re:An old problem (Score:2, Informative)

Re:An old problem (Score:3, Funny)

Re:An old problem (Score:5, Informative)

Re:An old problem (Score:3, Insightful)

Re:An old problem (Score:5, Insightful)

Professor Turing once contemplated this... (Score:2)

Re:An old problem (Score:2)

Re:An old problem (Score:3, Interesting)

Re:Woah! (Score:2)

Re:An old problem (Score:2)

Re:An old problem (Score:5, Informative)

Re:An old problem (Score:4, Insightful)

Re:An old problem (Score:2)

Re:An old problem (Score:3, Interesting)

Re:An old problem (Score:2)

Re:An old problem (Score:2)

Re:An old problem (Score:3, Interesting)

Re:An old problem.. now usedto fight the overlords (Score:2)

Re:An old problem (Score:2)

Fearmongering? (Score:2, Interesting)

Re:Fearmongering? (Score:2, Insightful)

Fearmongering? No, you misunderstand ... (Score:3, Informative)

Re:Fearmongering? No, you misunderstand ... (Score:2, Funny)

"allowed...to slip through...detection grid..." (Score:2)

Obligatory (Score:2)

Uh oh.. (Score:5, Funny)

Re:Uh oh.. (Score:2)

Re:Uh oh.. (Score:3, Interesting)

Deja Vu: Intel Processor's Bug in 1994 (Score:3, Insightful)

Re:Deja Vu: Intel Processor's Bug in 1994 (Score:4, Informative)

Re:Deja Vu: Intel Processor's Bug in 1994 (Score:2)

CALL ESP (Score:4, Interesting)

Re:Deja Vu: Intel Processor's Bug in 1994 (Score:2, Insightful)

nice! (Score:3, Interesting)

Re:nice! (Score:2)

Re:nice! (Score:3, Informative)

Re:nice! (Score:2)

It's like you're overclocking when you're not (Score:5, Insightful)

99% of reported Pentium bugss were program flaws (Score:2)

Re:99% of reported Pentium bugss were program flaw (Score:2)

Could be worse (Score:2, Funny)

Prime95 as a detection tool? (Score:2, Informative)

Re:Prime95 as a detection tool? (Score:2)

Re:Prime95 as a detection tool? (Score:2)

Quality Control at AMD must be good. (Score:5, Interesting)

Re:Quality Control at AMD must be good. (Score:3, Informative)

Re:Quality Control at AMD must be good. (Score:2)

This will not happen to you (Score:5, Informative)

NOOOOOOOO! (Score:2)

Re:NOOOOOOOO! (Score:2)

Coincident Advertising (Score:2)

Humanly reproducable :) (Score:2)

Re:Humanly reproducable :) (Score:2, Insightful)

Surprising. AMD uses my `cpuburn` (Score:5, Informative)

Re:Surprising. AMD uses my `cpuburn` (Score:2)

This flaw seems damned serious to me... (Score:2, Insightful)

Interesting! (Score:2, Interesting)

Re:What? (Score:3, Interesting)

Re:What? (Score:3, Interesting)

Corruption (Score:3, Informative)

Re:Corruption (Score:5, Insightful)

Re:Corruption (Score:2)

Re:Corruption (Score:3, Informative)

Mainframe ;) (Score:2)

Re:Corruption (Score:2)

Re:What? (Score:2)