
Tracking Down The AMD "Processor Bug" 237
tercero writes: "over at the Gentoo Linux website there is an update on the AMD processor bug mentioned here. The sum up is that AMD claims it's not a bug with the Athlon processor, but with the motherboard. More detailed information can be found on this LKML post."
An Anonymous Coward points to a similar explanation at Linux Weekly News.
Update: 01/25 01:25 GMT by T : Daniel Robbins from Gentoo clarifies: "AMD is not
calling this a 'motherboard' issue, it is an interaction between a
feature of the Athlon called 'speculative writes' and the design of the
GART, which is not cache-coherent. It's a 'Athlon/cache coherency/GART'
problem, not a 'motherboard' problem."
Bug? (Score:4, Funny)
Oh wait that ws Intel.
Think "Matrix" (Score:5, Funny)
According to young bald children everywhere, "There is no bug".
In related news, the motherboard manufacturers are quoted as saying, "It's not a bug with the motherboard, but with the Athlon processor."
--SC
All of the above. (Score:5, Informative)
According to young bald children everywhere, "There is no bug".
In related news, the motherboard manufacturers are quoted as saying, "It's not a bug with the motherboard, but with the Athlon processor."
Funny, I didn't think I was bald...
It's an Athlon bug if you think doing speculative writes is a bug.
It's a motherboard chipset bug if you think that the AGP controller should play nicely with cache-coherence protocols (right now it doesn't, presumably to gain a speed boost).
It's an OS bug if you think that the OS should be bright enough not to make AGP-touched memory cacheable (it wasn't intended to be).
I'm voting for option 3), myself.
Re:All of the above. (Score:1, Offtopic)
Re:All of the above. (Score:2)
I'm voting for option 3), myself.
I don't know about that... if you'll recall there have already been pre-emptive blamings from the Linux gurus, so since they're always right, well, it must be something else.
So I quote from Gentoo.org:
The bad news is that a major Athlon CPU bug has been discovered, and it affects Linux 2.4. Note that this is a bug in the actual CPU itself, and is not a Linux bug.
And:
And, the kind folks at AMD even created a simple patch for Windows 2000 that disables extended paging by tweaking the registry.
Cache coherency aside, what about logic coherency? These were posts in the same article. To me, it seems as if the Linux guys and gals are ready to blame everyone else but themselves (sounds like the phone company!).
So quick to blame others... but why not? I mean, if everyone believes you, then you're good to go!
Re:All of the above. (Score:2)
Re:All of the above. (Score:4, Funny)
Probably quite good. I imagine if you examine both systems carefully you'll see a BSD license agreement in the system binaries that deal with AGP.
Re:All of the above. (Score:3, Insightful)
I'm voting for option 3), myself.
I thought one of the main benefits of AGP was the ability to remap a bunch of non-contiguous physical blocks into one address space, so the entire bunch could be marked as cachable (for instance when DMA'ing a bunch of vertices across the bus).
Re:All of the above. (Score:2)
Re:All of the above. (Score:2)
The Nature of the Bug (Score:3, Insightful)
Hmmmm.
Is the Bug...
These kinds of bugs would have significantly shorter duration if the specifications for all four possible culprits in (A)-(D) were openly published, completely, for all to see.
Re:The Nature of the Bug (Score:3)
Open sepcification? The OS is open source. That doesn't mean that anybody in their right mind would want to read through it, but it's available.
Chipset specs are all agreed upon by a standards body, as are the bus specs. It's not like any one manufacturer was keeping a protocol secret so everybody just guessed at it. If you want to see them, go ahead, I'm sure you can find information. It's not going to be dumbed down and summarized for you though. It's going to be white papers with lots of weird electrical diagrams and acronyms.
What we're seeing here is when several layers don't correctly define or handle an error condition. You can compare it to U.S. laws. They define the general outline but don't actually define every possible parameter, they let the courts handle the specific details. Consider the OS the court.
So most likely, something was left out of the spec and assumptions were made by several different manufacturers who each has their own take on how it should be done.
But, since the software is easier to change than the hardware, that's the logical place to fix it. It doesn't take all the complaining and finger pointing that's going on this site. Just fix the damn thing, release an update, and move on. Although it's very interesting to see how so many bruised egos defend themselves in the ways that so many of us complain about when our enemies use those tactics.
Re:Think "Matrix" (Score:2)
WHICH MOTHERBOARD?? I've got a bunch of customers with AMD's and NVIDIA-custom-driver setups out there in the field, KT-133A and KT266 and AMD-760 chipsets, that are seeing zero problems and wondering WTF is going on.
So am I.
Is this specific to the NVIDIA chipset or something? I've never seen this thing manifest.... what does it look like? I saw the guy say Windows went blooey... what would it do in Windows? Signal 11 the X server? Hang the box? Oops?
The world wonders here, and I'm not getting very many details.
I work in software (Score:4, Funny)
Meanwhile, we're feverishly fixing your bug in our software.
"Yes sir, we've patched around the OS problem and this should get rid of that nasty bug you were seeing."
Re:I work in software (Score:3, Funny)
So you code Windows apps then...
Don't blame AMD entirely (Score:5, Insightful)
Re:Don't blame AMD entirely (Score:2)
It would have been fixed months ago if AMD had labeled it a hardware bug. It was billed as a "Win2k Bug" and quite naturally Linux hackers don't tend to pay much attention to that class of problems.
It's not a bug!! (Score:4, Funny)
Re:It's not a bug!! (Score:1)
ha
*cough*
ha ha
*cough*
percentage of affected chips? (Score:1)
Not a bug, a design issue (Score:1)
Re:percentage of affected chips? (Score:1)
Red Hat 7.1
PNY Verto AGP vidcard (nVidia GeForce 2 MX400)
AMD Athlon 1200
Asus A7A-266 motherboard with ALi 1647 chipset (bad AGP problems)
The result? It's working like a dream so far.
To be fair I should point out that I've not tested the video card with anything more stressful than playing a few DVDs, and the longest I've had the computer turned on was about 4 hours.
One thing that confused me was that the documentation for the nVidia driver said it would automatically disable AGP if it detected the chipset I have on my motherboard, but the drivers output from X startup says it's running in 4x AGP mode. Curious, but not unpleasing as long as things keep running well.
the day after : (Score:1)
In other news, Nintendo has signed a deal with Microsoft and Sony to port their world known games to the competitors console systems, and slashdot users deem Windows XP 'pretty okay'.
Film at 11.
This is embarassing (Score:3, Insightful)
What I don't understand is how this could have made it so far? This is exactly the sort of problem I have been telling people we don't have in the Linux world, and now it looks like I was wrong. Is this pointing out an underlying problem we have with QA in the Linux kernel? With Open Source in general? What can we do to make sure that a bugs of this magnitude are detected more quickly?
Re:This is embarassing (Score:2, Insightful)
Actually, it isn't embarassing at all. It wasn't the "Linux Community"'s fault. This is the fault of AMD who anounced/classified the bug as a Windows 2000 issue instead of a hardware issue. Many posters have pointed out that kernel hackers probably don't follow hardware bug reports for OTHER operating systems.
The failing was on AMD's part, and nobody else. But don't get me wrong, I love AMD, and this won't change my overall opinion of them. If things like this continually happen, then I may have to reconsider. But if this is a one time thing, I'm not going to get overly mad, and I hope no-one else does either.
Re:This is embarassing (Score:3, Informative)
If you read the technical writeup on LKML, you'll see that it's not a hardware issue, but a software bug. Which is why AMD announced the bug as a Windows 2000 issue--it is one. Linux also happens to have the same bug (it's a subtle issue and an easy mistake to make, IMO), but how was AMD supposed to know that Linux was doing the same bad thing--mapping the AGP GART area cacheable, when the GART is non-cacheable?
Re:This is embarassing (Score:2, Informative)
Oh, that's easy. The engineer who discovered the problem should have realized that it's not necessarily a Windows-specific issue, but a problem that any OS could have. He should have then tried to contact all the OS vendors, not just Microsoft.
Considering how Linux is used by a higher percentage of AMD customers than Intel customers, AMD should have paid more attention to an important segment of its customer base.
You are assuming... (Score:5, Insightful)
You are assuming that AMDs current explanation is 100% true, correct, and complete. There are good reasons to doubt this.
The "explanation" so far has just raised more questions. Why does the same code that causes the athlon to crash work fine on pentiums? Apparently the GART is cacheable on pentium systems? And the Athlon is billed as pentium-compatible...
Why does disabling large pages fix the problem? If their explanation is correct, that fix should not work, because it doesn't address the issue they claim to be the problem.
I'm sure this will get worked around in software (and the linux fix will actually workaround the underlying problem, rather than just making it less likely as the windows world seems to be satisfied with) once the real details of this are known. But to claim it's not a hardware bug is ludicrous. It's a bug with the Athlon CPU, or with certain GARTS found in Athlon chipsets, or both. If AMD were less worried about spin-controlling it and claiming it's the software at fault maybe they would be more forthcoming about what is really going on here.
Re:You are assuming... (Score:5, Interesting)
There are different types and levels of compatibility. The Athlon claims base-instruction-set and register compatibility with the Pentium, but it's not pin-compatible and may also differ in any number of behavioral/timing characteristics. This is one such case. The behavior in question is perfectly acceptable within the bounds of the compatibility and standards compliance that AMD claims.
Because it's the large pages that are (incorrectly) marked as cacheable. No large pages, no incorrect mappings, no problem.
Nope. It's a bug in the OS. Anyone who works with memory systems should know the dangers inherent in mixing cache-coherent and non-coherent accesses to the same memory, and should mark pages accordingly.
It's very tempting to criticize AMD for their handling of speculative writes, but that handling is really irrelevant. It seems to me that the cache line's contents should not be marked dirtybefore the processor has actually written to it (which in this case it never does). Under normal conditions, though, this would only be a performance issue. If a coherent access were made from elsewhere, invalidation and writeback would ensue; the writeback would be unnecessary but not harmful, because it would be writing the same data that were already in main memory. However, the cache wouldn't be involved in the first place if the pages were mapped correctly. There would be no write-allocate, no invalidation, no writeback, and no problem. The invalid mapping turns a slightly silly but legal and normally-harmless processor behavior into a serious coherency problem.
Re:You are assuming... (Score:2)
That's only half of the issue -- and certainly not illegal behavior. If the line first went into 'E' state, you would still have a coherency problem if you later did a write. Although, true, the issue only becomes visible if you go straight into 'M' state.
I think the real root of the problem is that they are doing speculative writes (obviously, they mean 'speculative reads-for-ownership' since speculative writes are highly illegal in IA32) into a page which the processor is not storing to.
If the OS knows that it is not going to do any loads or stores to a page, it should have the right to give the page any memory type it wants, and it shouldn't have to worry about coherency issues, because, as far as the software is concerned, the processor is not a participating agent on the bus.
IMHO, it's a processor bug, but obviously one that's very easy to workaround (make the memory UC, even though you're not using it).
Re:You are assuming... (Score:5, Informative)
There are Pentium systems with an AGP port? If you mean the Pentium II and up, I don't see why the GART would be cacheable there either; I don't know if the P4 chipsets have changed things, but with the PII and PIII, here's what Intel had to say about the subject:
(Emphasis added). As for why the bug doesn't happen on Intel CPUs, it sounds like the Athlon has more aggressive speculative writes and can change memory that wasn't explicitly written to, dirtying the cache. But in any case, even on Intel CPUs, the AGP area is supposed to be mapped non-cacheable.Why does disabling large pages fix the problem?
Don't know about that one; I haven't read the various tech docs for the Athlon. Perhaps the cache works slightly differently with 4MB pages vs 4KB pages?
Re:This is embarassing (Score:2, Interesting)
How did AMD know that Windows-* was doing the bad things? I guess it didn't occur to AMD to download and inspect the kernel source code or talk to the linux kernel mailing list(s) and developers? It seems to me that that is effectively what they would have had to do with Microsoft.
OFF-TOPIC: This sort of touches on the point another poster made 1-2 weeks ago or so. I probably will recall the specific accusations incorrectly (and hence flamed), but the gist of that post was the hypothesis that AMD has a loyal following of users, in particular linux users, and it would be nice if AMD reciprocated a little in recognition of that. I am largely ignorant of AMD's contributions to the community per se, so put the flame on a low setting, ok?, as I am an AMD newlywed myself
Re:This is embarassing (Score:3, Insightful)
Maybe because Microsoft reported the problem to them and asked for help?
Perhaps I worded my question poorly--why would AMD even think that Linux had the same bug as Windows 2000? Whenever you see a Windows bug, do you usually wonder if Linux has the same bug? They're completely different codebases, and there's no reason to think that a bug in one OS would be present in the other.
Re:Easy - Buy Intel. The cost of using 2nd party.. (Score:2, Informative)
If you paid attention to benchmarks you'd see that in almost every case AMD has a higher cost effectiveness than Intel. If you have some specific examples of why AMD is not a good choice (as opposed to vague, illogical ramblings) then why don't you share them? Prove that your mumblings are, "not made up of bugus stuff"
Re:Easy - Buy Intel. The cost of using 2nd party.. (Score:3, Interesting)
What planet are you from? Lower costs (in the case of demonstrated similarity in performance) typically means lower demand and lower consumer valuation of the brand name, which means smaller user base, which means that it generally takes longer to run into compatibility flaws.
For instance, Nike is more expensive than Puma. Does that mean Nike shoes are better? Of course not, it means people are more willing to buy Nike, because they percieve that the brand gives them additional values. In the world of shoes, that value is the value of conformity and fashion
Thue funniest thing is you're talking about performance. Performance is how well something works when it works. When it
Lest you cite this situation as a reason why I might be wrong
Re:Easy - Buy Intel. The cost of using 2nd party.. (Score:2)
er, I meant lower possibility of latent design flaws with a large user base. A smaller user base increases the likelihood of problems existing unnoticed for an unspecified amount of time.
What if the puma 'freezes' while you run? (Score:2)
I had the same opinion about wasting money on Intel, so I bought AMD. Though I'm glad you're having a good experience with your AMD, I simply can't agree:
I really don't agree with your feeling that performance is how well something works when it works. If that were true, I could just stay home from work most of the time and kick butt when I show up. (OK, I sort of do that now, but that's another issue...)
Performance, in my book, is a judgement of how well something is doing its job. My AMD 'sometimes-kickass' workstation is not performing well in my opinion, even though when it does run, it runs great.
If I look at the system as the tool that it's supposed to be, it simply isn't giving good performance.
Let me explain:
I have a few boxes on a small network at home - My main workstation is an Asus/AMD 1200Mhz setup running RedHat 7.2 - Before that, it was an Asus/AMD 600Mhz setup. Both systems have had the same problems, even though *Every Component* has been replaced in an effort to track down the problem. This morning, it froze a few minutes after the screen saver kicked in.
Each time this happens, I have to do a hard reboot. The other day, I added that mem=nopentium option and it still has the problem.
I used to have some big drives in the PC, but they were getting thrashed by the powerdowns, so I replaced them with a single 10GB and moved the 2 60GB drives to the server in my laundry room.
The server, by the way, is an old 300Mhz IBM with an Intel chip and it happily chugs along, serving files by Samba, database stuff, CGI/Apache stuff, SSH logins, VNC logins, whatever I happen to throw at it.
This is a machine that I literally snagged from the trash, but you couldn't *pry* it from me at this point. I just ran uptime on it, just for kicks:
11:39am up 155 days, 13:45, 3 users, load average: 0.00, 0.00, 0.00
So, do I regret spending so much time and money on AMD? Yes.
Would I buy them again?
No.
To me it *is* a performance issue. The AMD system has not done its job.
YMMV,
Jim in Tokyo
Re:Easy - Buy Intel. The cost of using 2nd party.. (Score:2)
They probably do. Well, depending on size. If they're AOL-sized, I doubt they use PCs at all.
If they're smaller, they probably use AMD chips. Certainly they used celerons and other cheap technology.
There are two types of setups.
1) Very expensive server with the best (and best "name") hardware money can buy.
2) Cheap crap in a fail-over cluster.
For many things like email servers, news servers, etc, the cheap cluster is most cost efficient, easier to maintain (want to fix one? Unplug it and the others take over automatically), and easier to build.
While your ISP may not use AMD (the saving for a cheap duron + mobo vs cheap celeron + mobo aren't great when you get into motherboards with integrated video and lan) they would if it saved them any money.
There are some taks that are hard to "fail over" and those require a sturdy server, but even then, as long as it's not rack mounted, AMD has a good reputation (with an AMD chipset).
More information (Score:5, Informative)
More information is now available at http://www.gentoo.org [gentoo.org], including an analysis of AMD's response. AMD's official response was posted to LKML, and is available at http://www.geocrawler.com/lists/3/Linux/35/175/76
There is apparently some kind of bad interaction between the AGP GART ("Graphics Address Remapping Table", I think?), speculative memory operations performed by the Athlon processor, the memory mappings used by the kernel, and cache coherency. The details are beyond me, but the practical upshot appears to be that the wrong data ends up being written back to main memory at some point.
I recommend reading the above LKML thread if you suspect you are affected by this issue. Information is still being uncovered, and it is not immediately clear how this occurs, what causes it, who is affected by it, and how to work around it.
In particular, there is some uncertainty as to whether the "mem=nopentium" option actually prevents the problem, or merely makes it less likely to occur.
Re: (Score:2)
athlon xp dissection (Score:3, Funny)
After a few minutes he took the CD out, gave it to me and said, "Take a close look at it."
To my surprise the CD was quite cold to hold and it seemed to be heavier than before. At first I could not see anything, but on the inner edge of the central hole I saw an inscription, an inscription finer than anything I had ever seen before. The inscription shone piercingly bright, and yet remote, as if out of a great depth:
12413AEB2ED4FA5E6F7D78E78BEDE820945092OF923A40E
'I cannot understand the fiery letters,' I said in a timid voice.
"No but I can," he said. '"The letters are Hex, of an ancient mode, but the language is that of Microsoft, which I shall not utter here. But in common English this is what it says:
'One OS to rule them all, One OS to find them,
One OS to bring them all and in the darkness bind them.'
It is only two lines from a verse long known in System-lore:
'Three OS's from corporate-kings in their towers of glass,
Seven from valley-lords where orchards used to grow,
Nine from dotcoms doomed to die,
One from the Dark Lord Gates on his dark throne
In the Land of Redmond where the Shadows lie.
One OS to rule them all, One OS to find them,
One OS to bring them all and in the darkness bind them,
In the Land of Redmond where the Shadows lie.'"
Similar problems on an intel P4! (Score:2)
it's not a bug with the Athlon processor, but with the motherboard
I somehow wonder if this is related! I had a P3 system, with Gforce 2head card everything was working fine, I replaced the motherboard for an ASUS P4B, and a intel P4 chip. Ever since I intermitently get a BSOD, (bad pool caller).
Point is, isn't this very similar to the problems that AMD were reported on Win2k system without the patch?
Re:Similar problems on an intel P4! (Score:1)
Well, I'll be (Score:1, Informative)
I'm wondering if this issue is related (Score:2, Troll)
I also noticed that as I run programs, not all the memory used by the program is freed when the program terminates. I ran the System Monitor and it revealed to me this information. I'm not sure if this is Athlon or Windoze related. Anyways, I'm suspecting that the problem may not be limited to Linux boxes.
Re:I'm wondering if this issue is related (Score:1, Offtopic)
Does it affect other versions of Windows? (Score:1)
and all of the others mentioned, but I have yet to see any mention of any version of Windows being affected other than Windows 2000. Windows XP is not affected - I found that out here and on the AMD site (Microsoft's site, oddly enough, does not mention this). Does anyone know if the bug affects Windows 98?
so, how do I tell if I have the problem.... (Score:1)
this is not a motherboard bug either... (Score:1)
This is all caused by AGP. Once again, the race for more frames per second in quake3 has caused a stability damaging technology to become mainstream.
ugh..
john.c
Re:this is not a motherboard bug either... (Score:4, Interesting)
The AGP GART (Graphics Address Remapping Table, I believe) maps "video card memory addresses" to "main memory addresses", i.e., it's to allow the graphics card to grab textures, etc. directly from main memory without going through the CPU.
Many motherboard manufacturers use this feature to provide on-board video without any dedicated memory so they don't have to include any additional memory for the graphics card.
Of course, since this blows so massively performance-wise, it's mostly abandoned now.
Is the GART actually useful for anything except extending the video card's onboard memory? I'm not really sure...
Re:this is not a motherboard bug either... (Score:3, Informative)
1) Get a video card with 270+MB of memory. (Yeah, right.)
2) Snatch from main memory the portions of the texture you need. (This gets slow AND ugly if you use more than ~16MB in a single frame.)
3) Use the GART, take (less of) a performance hit, and just keep the textures in system memory.
This was the original purpose of the GART, and is still important.
Don't cache it then! (Score:4, Insightful)
Why would somebody want to cache the AGP memory? I'm pretty sure it's used 99.99% of the time as write-only memory, because it's the main output method of most computers. What's the point of caching that? It can only prevent the use of the CPU cache by some more important things, no?
Feel free to correct me if I'm wrong, I'm not very familiar with the usage of AGP memory (or GARTs).
Re:Don't cache it then! (Score:2)
T
Re:Don't cache it then! (Score:1)
Mibibytes and kibibytes. They refer to 2^20 and 2^10 bytes, respectively. (I.e., what many other people call megabytes and kilobytes.)
The scientific community had decided on SI unit prefixes like 100 years ago. "Mega" means 10^6 and "kilo" means 10^3. The computer science people came along and said "no, those will be powers of two for us." These units are a (probably futile) attempt to correct that particular stupidity.
I think it won't work out, because there's too much legacy stuff that there will always be confusion at this point about what "mega" and "kilo" mean with computers. Besides, "mibi" and "kibi" sound stupid enough that they'll probably never catch on.
Re:Don't cache it then! (Score:1)
Re:Don't cache it then! (Score:2)
T
Powers of two (Score:2)
Not to mention the fact that computers are incapable of "thinking" in anything but a power of two. You will not find a discrete quantity of 10 (or a power thereof) bytes anywhere in a computer system. This makes the SI units useless for computers. While re-defining them for use in computers was and still is an abuse, the lack of applicability of the conventional SI units makes it largely a non-issue. The only people who care are are HDD manufacturers who rate drives in "millions of bytes" so they can swindle stupid customers.
Re:Don't cache it then! (Score:2)
Re:Don't cache it then! (Score:2)
The are "MibbleBytes" and "KibbleBytes" respectively.
;-)
Not necessarily the motherboard (Score:1)
is this why... (Score:1)
It is (not?) a CPU bug. (Score:2, Interesting)
Anyone have any knowledge as to how intel treats this 4mb pages different?
I mean, if the bug is caused by AMD's precaching of AGP Gart mapped memory, and intel just doesn't precache that memory, then now is it NOT an AMD processor bug?
When two processors aren't equal, there has to be a reason for the difference in running software.
(Note that I prefer AMD, so I'm just looking for answers, not trolling).
Re:It is (not?) a CPU bug. (Score:5, Insightful)
Well, based on my reading of other posts, it is a simple case of AMD taking advantage of some features of AGP that are within spec that Intel is not. When the OS assumes that things are done Intel's way instead of adhering to the spec, things will show up on an AMD processor and not on an Intel.
AMD is doing things correctly, albeit differently from Intel. This is exactly how we are supposed to believe that it's not an AMD bug.
T
Re:It is (not?) a CPU bug. (Score:1)
It's unfortunate really, because it's all a matter of numbers. Since more people use intel stuff (for now anyway), whenever those people hear about a bug like this, they assume (again, and wrongly), that AMD produces inferior processors, when it is actually a case of people being out of spec to begin with.
Re:It is (not?) a CPU bug. (Score:1)
AMD's CPUs support something (speculative writing) that Intel processors don't, and the Linux kernel has a bug that is only noticeable when this feature is used.
Setting device memory as cacheable is a kernel bug, no matter what the processor does. It's a kernel bug even if you're using Intel chips.
This sort of thing is commonplace in computers. A certain piece of software or hardware doesn't follow some specification to the letter, but because the components involved don't support any features that require strict compliance, the bug isn't noticed for years. Then, one component is updated to support a new feature that relies on strict compliance. That's when all the bugs appear.
new thinkgeek item specifically designed for AMD?? (Score:2)
Is this only a Linux problem? (Score:2)
In doing extensive research on the problem, I found very large numbers of people with the same problem and very little explanation. I tried MANY different solutions and eventually found one that worked. It involved wiping everything out and installing hardware and software in a VERY specific order. It seems that if you don't install the VIA 4-in-1 drivers (which include GART) at just the right time in the system building, the drivers don't work properly and thus the random lockups.
I wonder if this is in any way related to the problem here.
-S
Re:Is this only a Linux problem? (Score:2)
Very, VERY common problem.
This is actually good... (Score:1)
Archived posting at MARC (Score:1)
VM Implications? (Score:4, Insightful)
When Linus switched to the AA VM, I got the impression that one of the key differences between the AA VM and the RvR VM is that Rik's VM is much more flexible, but with that flexibility comes complexity, which is why Linus switched to AA's VM. AA's was much simpler to understand and helped to stabalize the VM problems. Does the above quote mean that the AA VM isn't going to be able to handle the requirements to fix this bug? Is this a plug to put back RvR's VM?
I'm not trying to start a flame war here, just want to understand if I understood what the final paragraph was saying. Please mod me down if I'm way off base, but help me understand too!
Re:VM Implications? (Score:4, Interesting)
And eventually, all memory management systems will either reach an out of memory issue (even with a reserved cache, the OS can still grow beyond safety margins) and either stall or kill processes. While some people feel that RIk is focusing a little heavily on the killing processes side, it is something you have to be prepared to do so you want to kill a less useful task (a forked apache server, not the main process, for example) instead of killing something critical to operation.
You can usually come up with a simple solution that covers 95% of the cases very well, but it'll fall apart on that last 5% in a bad way. The complex solutions often offer lower performance in everyday situations but guarantee performance will never get as bad as the easy solutions would allow.
So, I think anyone with design experience expects Rik's VM (or one like it) to go back into the kernel eventually.
Personally, I think Rik should look at the issue of having "Emergency" swap that you don't go into except for OS processess. Once main swap is filled all non-OS processes fail to allocate any new RAM. This lets the system function well enough for non-kernel code (ideally more customizable) to make a system-specific determination on how to proceed. For instance, kill any processes from
OS Bug (Score:3, Informative)
In truth, we should probably say it is a combination of a problem with the OS and a problem with the processor. After all, Intel processors don't have the same problem, simply because they work differently. So while it may not technically be the CPU's fault, the CPU does play a part.
Re:OS Bug (Score:2)
quite frankly, it's to early to know the truth of this problem.
Sounds like a trivial thing to fix (Score:1)
The problem is that probably none of the page table code cares to distinguish between cacheable and non-cacheable pages. But anyway it shouldn't be too bad to set up such a distinction.
Anyway I haven't looked at the kernel code that relates to this yet so I am not sure if I am over-simplifying things... but I trust that someone will have a hack to fix this soon (a hack that doesn't cost anything in performance, unlike that mem=nopentium option), and a proper patch that is more beautiful would probably come out a few days after that....
However, since NVIDIA's stupid bloated drivers contain their *own* agp GART code, we would also have to coordinate with that vendor to get them to change their GART code to behave properly. Either that or you can try using the linux kernel's agpgart.o with NVdriver, but in my experience Very Bad Things happen why you do that!
-Calin
Bang on. (Score:2)
himi
This makes much more sense...... (Score:1)
After the upgrade the bug hit me (the computer would lock HARD every time X started)- at first I just thought it was bad RAM, but replacing that didn't help. Eventually I figured out by my own troubleshooting that unchecking the AMD 761 AGP chipset fixed the bug....
Since the bug only appeared after the upgrade to a motherboard with an AMD 761 chipset, this makes a lot more sense. (Using the kernel option mem=nopentium fixes my problem, so it must be the same bug....)
It's Linux, NOT the motherboard! (Score:2, Informative)
"Our conclusion is that the operating system is creating coherency problems within the system by creating cacheable translation to AGP GART-mapped physical memory."
In the spirit of... (Score:5, Funny)
My Macintosh isn't affected by this bug due to its PowerPC processor.
Re:What is the exact parameter to pass to LILO? (Score:2, Redundant)
append=" mem=nopentium"
Make sure you aren't appending anything else. If you are, just add the mem=nopentium at the end of your existing append line.
Re:What is the exact parameter to pass to LILO? (Score:1)
Reason #20996 to use GRUB [gnu.org]: No need to use append=
Kernel parameter vs LILO config file (Score:5, Informative)
mem=nopentium
and turn off 4MB pages (which may or may not prevent the problem from manifesting -- the situation is unclear at this time). You can do this at the boot prompt like this
LILO boot: linux mem=nopentium
or by placing the configuration directive
append="mem=nopentium"
in your
See the manual page for lilo.conf for the details.
Re:Kernel parameter vs LILO config file (Score:1, Interesting)
> mem=nopentium
This does NOT help.
Only Option "NvAGP" "0" solves
TuxRacer problems. Go read the
linked docs and you will understand
why.
Re:this is something.. (Score:1, Offtopic)
Mac OSX isn't making the kind of surge they had hoped for. Pre-sales of the new iMac are low (although the machines are really cool!), and with so little market share to work with, Apple's fate is sealed.
Even Time's [time.com] review was less than glorious, and had a very ominous feel to it.
Too bad, too. I kind of like the fruity little buggers.
--SC
Re:this is something.. (Score:3, Informative)
FYI I don't own a mac, but I will purchase one next time I want a computer.
Re:this is something.. (Score:1)
Re:this is something.. (Score:2)
I vote for a +5 Informative mod. :-)
299,792,458 m/s...not just a good idea, its the law!
Re:this is something.. (Score:2)
Re:this is something.. (Score:2)
My secret wish would be for Apple to support AMD x86-64 processors with MacOS X...not that it'll happen. That would be a great combination.
On the other hand, if they can really hit 1.5+ GHz. with the G5, that'll be OK too. Just a lot more expensive.
299,792,458 m/s...not just a good idea, its the law!
Re:this is something.. (Score:2)
Yes, Slashdot moderation is ridiculous. Period. It's the biggest problem with Slashdot, and that's why I hardly ever read it anymore.
As far as Apple and processors... the G4 is a pretty good processor, and I have a Dual 800MHz G4 and am loving every second I get to spend on it... but I also have a Dual Athlon MP 1600+ (1.4GHz) and the Dual Athlon spanks the shit out of it at all the things that really matter (Quake 3
That having been said, the G4 is a fast machine. But... It was $4,000 with monitor ($500 Sony CPD-G400), whereas with a $1,000 monitor, my Dual Athlon was only $3,000. The only differing factors is that the Athlon has a GF3 Ti500 (the PowerMac has a GF3 regular), the G4 has 128 more megs of RAM (whoopty doo), and the G4 has an extra 60 gig drive. Now I'm not complaining - I think the machine was worth what I paid for it. But for $3,500 (computer itself), you'd think it should spank the shit out of a $2,000 computer (okay, $2,100 with shipping and all).
The point is, the G4 isn't that fast. Apple really would do better to put in some AMD processors, knock the price down a LEETLE and be able to claim that it really *did* burn Pentiums (instead of just with Adobe products).
One can only hope...
Re:this is something.. (Score:5, Funny)
Re:this is something.. (Score:1)
Besides, if the rumors are true, Mac dual-processor single-GHz boxen will be available within a month...
Flame bait? (Score:2)
The original post of 'Mac users don't have to worry about this [the Athlon bug]' is flame bait, my response was a humorous way of saying why 'this is why buying a Mac won't solve my problem.'
An Off-topic moderation wouldn't have bothered me since I didn't spell out my reasoning, but I do feel the flamebait call was bad.
Don't be so sure (Score:1)
While I suspect you are correct, there is/was discussion on LKML about whether or not other architectures would be vulnerable to this issue. The PPC was specifically mentioned.
Re:this is something.. (Score:1)
Best Regards,
Daniel Robbins
Re:A kernel bug -- not a motherboard bug (Score:2, Insightful)
Re:A kernel bug -- not a motherboard bug (Score:2)
Basically, it seems that someone figured that the GART shouldn't worry about the CPU potentially caching 4MB pages and simplified their circuits accordingly. Unfortunately, they forgot to tell OS developers (NB: I wonder if this affects other OSs like the now doomed Solaris/x86 or *BSD?) causing these problems.
Re:A kernel bug -- not a motherboard bug (Score:1)
Re:A kernel bug -- not a motherboard bug (Score:2)
You could also argue that the published spec for the GART states that it shouldn't worry and the OS developers didn't read the spec and assumed that everything worked just like a Pentium *.
Thus by conforming to a specific implementation, rather than the published spec, it is an OS bug. My architecture knowledge is rusty enough to be unsure which answer is correct.
Re:AMD & Bugs (Score:1)
Bugs in firmware are more plentiful than in hardware, but less so than pure software.
As Intel and AMD race to produce faster processors, expect more hardware bugs.