Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Sun Microsystems

Sun Gagging Customers Damaged By Memory Problems? 156

cchuter writes "Apparently Sun has been getting it's customers to be 'mum' about a certain memory problem for as long as 18 months. The problem is assumed to be the cause for many website outages (most visible, ebay). "
This discussion has been archived. No new comments can be posted.

Sun Gagging Customers Damaged By Memory Problems?

Comments Filter:
  • by Anonymous Coward on Monday August 28, 2000 @10:09AM (#821348)
    quite a number of times. 6 errors that I have handled in my Sun shop of 3 admins and 50 machines. Only seem em on the 400 mhz/8 mb cache processors. Looks something like: panic[cpu28]/thread=0x307dbe80: CPU28 Writeback Data Parity Error: AFSR 0x00000000 00800002 AFAR 0x00000001 8104dfe0 We've found that attaching a grounding strap to all of the servers that were effected has cleaned the problem up. Haven't seen one since the straps, been around 3 months where before it was once a month.
  • by namespan ( 225296 ) <namespan.elitemail@org> on Monday August 28, 2000 @10:10AM (#821349) Journal
    Perhaps memory problems are contagious. First, the computer displays them. Then, the customer mysteriously acquires them ("er, um, no. I don't think we had a memory problem"). Then, the vendor ("why no. I don't recall us pressuring anyone").


    Darn Jedi mind tricks...


  • > M$ products may go down a lot, but usually getting them running again isn't a problem.

    > Sun's almost never go down, but when they go down you can bet your ass that it'll be a pain to fix it.

    What?!!? Not at all! When A sun box goes you get on it's console, can probe scsi busses, do hw diags from the other side of the world via the serial console, etc.
    I admin many Solaris boxes, from Ultra 5s to E6500s, and barring hardware issues, when the crash getting them up and running is generally very easy.

  • Ultra 10's are IDE based systems with the exact same Hd's you use in a PC.
    One of the main reasons that you as an admin want Ultras on the desktop is that other than the odd disk failure, they don't require any work. One admin can easily maintain hundreds of boxes, assuming they have set up their infrastructure in a decent manner.
    I maintain hundreds of desktop U10s and I almost *NEVER* log into them or even give them a second thought because they *JUST WORK*.

  • "At Sun, we never let Microsoft tell us what to do, we just imitate them." -Don
  • I won't dispute that some companies are going to use an NDA as a buffer to buy more time. I somewhat doubt that this is Sun's motivation. It is a natural feeling to assume that companies or the government sit around and think up ways in which they can frustrate the common person, but it rarely plays out this way.

    One way to look at this is to think of it as a courtesy issue. Sun tells the customer that they are going to make fixing the bug a high priority, but they need some solitude in order to do it so they ask for some discretion. If their engineers are having to answer to middle and upper management about the issue instead of working on it (because of a media frenzy) then everyone is one step further from the solution. I think we can safely assume that Sun wants the problem resolved quickly. They just happen to also think that resolving it quietly is also quick.

  • I understand where you are coming from, and in the case of life threatening issues this is quite a different matter. The problem is that as a company you know that the press is lying in wait, like vultures, to jump on any issue and blow it out of proportion if they can claim that they are "on the scene".

    There are many very reputable organizations which believe it is in the best interest to have a low profile until something positive can be done. Take CERT for example. When they find out about a new exploit, they don't broadcast it to everyone. In fact, they discourage this because it gives dim-witted crackers a window of opportunity and causes a big scare. They advise that parties involved remain cautious and low-key so that a solution can be announced in tandem with the problem. Sometimes this doesn't work out. Sometimes CERT isn't the first to be notified or a problem pops up in too many places for them to keep it under wraps.

    I fully see your point about some benefits to people sharing knowledge, however there are occassions (which aren't life threatening) where it makes sense to encourage a low profile.

  • They are having problems because the sun techs set up the server with the default settings and left ebay to take care of the rest.
  • Uh - No!

    SRAMS are built with several transistors in a latching arrangement (there are 6T and 8T designs running around now-adays.) SRAMS are not nearly as likely to be taken out by radiation as DRAMS which are really just capacitors storing charge. Radiation was an issue 15-20 years ago and turned out to be MOSTLY due to radiation from the plastic the rams were cased in! Modern methods include coatings that make this a non-issue.

    Caches being built out of SRAMs mostly have OTHER failure modes that are in some way design errors - like maybe electron migration which is essentially a way IC's age.

    So - summary - it AIN'T radiation that is taking these chips out.
  • >With Microsoft losing its Imperial hold, Sun is >beginning to look like a pretty shifty company, >casting doubt on its commitment to its customer >base.

    Can we have some facts to back that up? I'm suspicious of the notion that any successful company or product can't be good. Last time I looked, IT was the most competitive business around - I doubt Sun can afford to neglect its customers.

    If Linux gets more popular, will we start seeing everyone criticising it and vaunting FreeBSD instead?
  • A tax cut for the ruling class is not fucking excuse for genocide.
  • Not to doubt the "facts" that you've presented here, but do you mind showing us some sources for these?

    That way when someone asks me "Where did you hear that this is CACHE issue?"

    I don't have to say "Oh. Some anonymous person on slashdot told me..."

    I don't have a problem with the anonymity: it just doesn't make for a very good reference.
  • by Anonymous Coward
    I'm posting anonymously to protect my job and my employer. We have 3 E10000 and 7 E6000 systems in our production environment. All of them have had the problem. All of them have gone through multiple exchanges of CPUs, memory and system boards. We are seriously looking at switching hardware vendors at this time. We are looking at RS6000s as a possible replacement, but it would take a considerable amount of work to recompile all of our custom apps for their environment. We also have about a dozen S-390 mainframes and have been looking at Linux under OS390 too.. Either way, I think we are going to be getting rid of Sun
  • I'm sure VA Linux make good kit - but last time I checked they weren't making systems that scaled up to the 64 processor machines eBay are using...
  • I mean yeah, he cut taxes and gave people their money back. That bastard!
  • I fully admit the statement should have been clearly marked IMO, as I was only basing it off of how Sun dealt with this memory problem, especially in light of how Intel dealt with their 1.13Ghz chip problem. Two quality control issues, one dealt with in the open, one dealt with in the shadows.

    I also didn't mean to imply Sun was neglecting its customers but instead neglecting to keep its customers informed of potential problems.

  • Dunno where the Gartner Group gets its figures from.
    I believe the answer is "thin air." Or "out of their collective ass."
  • The nondisclosure agreements were apparently offered with a claim that signing them would bolster Sun's commitment to resolving the problem quickly
    How could anyone believe this? The overwhelming evidence is that nothing motivates big companies to action like the threat of impending negative publicity.

    -y

  • by ragnar ( 3268 ) on Monday August 28, 2000 @10:12AM (#821366) Homepage
    Please refer to following article I wrote [solariscentral.org] a few months back to dispell some of the hype about the Ebay problems. The article that /. cites is speculating that these events are related, but to best of my research and feedback from many parties involved, the problem lies on ebay.

    Also, do understand that these sort of NDA's are somewhat common when dealing with potentially explosive matters like this. Certainly Sun is interested in keeping tight lips, but they also would prefer to announce a solution along with the problem. It is an engineering problem where the "more eyes on the problem" approach doesn't necessarily bring about the greatest good.

  • by empesey ( 207806 ) on Monday August 28, 2000 @10:14AM (#821367) Homepage
    Wasn't it Sun that was complaining about Microsoft forcing clients to sign all those agreements forbiddding them to talk about some of Microsoft's practices? Granted, these are two separate issues, but now that Sun is having the issues, it's suddenly a different matter. Shut everyone up, and hope that no one finds out, before we can rememdy the problem or ship a new product.

    And why would you have to bribe people that you'll fix something quicker, if they sign an NDA? That's an automatic red flag. I find it hard to believe that CEO and other top brass fall for such nonsense. There must be more to the story that was has been disclosed.

    --
  • 1. Comparable preformance to a 486
    2. Costs the same used (roughly I can get now a low end pentium 90 for $100 ).
    3. Dosn't need massive hardware requirements for it's native OS.
    4. Actually sold by lots of local vendors (I currently reside in SLC, Ut for now) I have never seen a vendor that sold Sun hardware within 100 miles of my location.
  • The nondisclosure agreements were apparently offered with a claim that signing them would bolster Sun's commitment to resolving the problem quickly, Henkel said. Sun customers began reporting the problem as long as 18 months ago, he said.

    Wow does that sound like a bad idea! Talk about giving up your leverage. Sun must have offered some serious concessions in order to get them to sign this. I wouldn't even consider something like this without an expiration date. That would at least give the hardware vendor some incentive to focus their resources on resolving the problem.

  • I'd enter into my Matrix-esque quote, 'That sounds like a good deal, but I have a better one. I give you the finger, and you give me my phone call.'

    Wouldn't a better one be: "Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger and you fix my hardware?"
  • Temperature? I find that hard to believe with the hardware they're referring to. Sun's Enterprise class servers are quite capable of running at fairly warm temperatures. Heck, I've got an E3000 running in the corner of a room (read: not a climate controlled data center) and it runs just fine even with CPU boards averaging about 50 degrees celcius. Yes, it's not a good idea to let this go for very long, but the hardware is quite capable of dealing with the situation without resorting to reboots. For the record, one of these CPU boards in a proper data center would average about 25 degrees celcius.

    --
  • Well let's see what if my application works best on sun hardware and I want to continue to use them as a vendor? Maybe then.
  • You can do a quick core analysis yourself with the iscda (Initial System Crash Dump Analysis) tool available from Sun's site. I've used this to find the cause of many crashes, including CPU cache related ones. Often the cause of the crash doesn't get syslogged, but if you have "savecore /dirname" enabled in /etc/init.d/sysetup you get a system memory dump when it reboots AFTER crashing. If you can't figure out what caused the crash by looking at the iscda output, send the cores to Sun, becaused they definately will.
  • Sounds like Ford and Firestone. Firestone wouldn't give Ford the data to determine if there was a problem or not. Must be great to be so dependent on a supplier that you can't risk offending them. Then again Sun always had a bigger reputation than the reality (and you thought ms was a marketing co :-). Remember that Sun first talked about the Sparc3 back in late 1995 when they got scared by the Pentium Pro and the 4way Pentium Pro benchmarks beat the 4way Sparc2s until Sun turned up the compiler & tuning gimics (then they had the smarts to write articles accusing the competition of doing it). And shortly after that they stopped allowing the publishing of any processor for processor (equivalent configuration) benchmarks. Not that I blame them, the Sparc2 is now 2.5 generations old and the Sparc3 will be born as the old man of the field. Gartner is now advising customers who need performance to go with IBM (which soundly beat Sun's 64way systems with only 24 processors - and no benchmarking gimics).

    Note that it isn't RAM (main memory) but the fast static memory parts used in the Level 2 cache on the Sparc2 card that are failing intermittently. What I don't understand is how Sun can be having problems with this. Their clock speed is a fraction of Intel's, their L2 cache runs at half or less processor speed, and Intel has been shipping CPUs with L2 cache that runs at the processor clock rate since 1995 and the first Pentium Pros. Does anyone know how the failure affects production? i.e. is there secded or parity on the L2 so a failure is a halt or does data get corrupted? Does the system keep on running after detecting and isolating the processor that failed (does it recover the process like IBM, or kill it?, what if the processor is in kernel mode?). Or is all this information now NDA by the "open company?" :-)

    If I were holding Sun stock, I'd be getting nervous. The current boom (where anyone with an internet product to sell can make money) can't last forever. I wouldn't want to recall all those UE10000 which are boatanchors already (assuming the sparc3 ships tomorrow - and you can't just plug sparc3's into today's UE10000s).

    /Don
  • According to the article, it's not a problem with the DIMMS; it's an E-cache problem on the cpu module. The article states that it affects 400Mhz 4/8MB cache modules (ie. Sun Part number 501-5762).

    Of course, other than that I have no comment.


    _damnit_
  • This was, I thought, general knowledge. My consultancy has has had numerous clients aware of these problems on 220r and 420r with 8 Mb cache.

    This has been on USENET, since February, at least.

    Suddenly, it's news to the world!

    --Jeremiah

  • (insert obvious joke about solar radiation and hits to the ram chips)

    --
  • Yes, but if everyone takes the red pill, they never have to piss anyone off, so they get to keep all their customers happy.

    If you are a manager, and the choice is take the NDA, and get good service, or risk going bust - there is no decision to make here.

    Let someone else make a stand: principles are nice, working webservers/databases are much, much nicer when your business is on the line.

    So long as everyone plays nicely for Sun, they get all their customers to sign NDAs, and their customers all get good service.

    G
  • Being a Solaris/Sun sysadmin since '95, I have also always been impressed by Sun's hardware and operating system.

    During the last years, this has shifted somewhat. I am no longer all that impressed. First there was the SS5-170's that would hang if a certain run of instructions was executed on the CPU (gcc happened to produce that exact instruction sequence when compiling the program 'main(){while(1);}'. In a university environment as ours, this is entirely unacceptable. The main reason for running UNIX workstations is so that the students can experiment without hanging the machines.

    Sun was very nice about the problem, and helped us replace the faulty machines with U-10's. A later batch of U-10's now exhibit a similar problem, where the CPU will, with a certain probability, hang completely on a certain sequence of instructions (albeit not the same sequence as for the SS5's). A sequence that programs such as Netscape and Explorer execute rather frequently. At it's peek, we rebooted on average five of our 150 workstations every day. That is just far too much.

    Sun appears to think that the problem is solved, since they have modified libc so that it won't trigger the bug like it did. In my view, if a user-land program can hang a UNIX workstation (even if that user-land program is written in assembler), that workstation is defective. We are still arguing with Sun, who refuse to replace the machines or otherwise compensate us.

    Then there was the problem with the U60's that would get corrupted display output (the screen image didn't stand still, it was unsuable) under heavy memory accesses due to a defective frame buffer. It took Sun more than 6 months to replace the frame buffers, and they still refuse to refund or otherwise compensate us for leaving us with defective computers for 6 months.

    Also the number of serious bugs in Solaris 8 is, half a year after it was initially released, still at an incredible level. We switched to Solaris 8 in mid-July and have already reported three kernel bugs that hang the computer when triggered (which our users manage to do all the time, just running netscape and mpg123) and one badpatch which made our webserver unusable for a week. Sun has been quick in fixing these problems, but they should never exist in the first place in an operating system that is six months old!

    Of course, Sun is better than running PC's. Had we used Microsoft, we would certainly have had bigger problems. PC hardware also isn't that hot. But we are seriously considering throwing the Sun workstations out the door and replacing them with high-end PC's running FreeBSD or similar and only keeping Sun in the server rooms. Sure, we would have to keep separate binary trees, but it might be worth it. Sun workstations aren't exactly free.

    Sun is great, but not as great as it once was. They need to get their act together quite heavily if they want to keep the market share they have.

  • by echo8 ( 227444 ) on Monday August 28, 2000 @12:34PM (#821380) Homepage
    I certainly can't speak to why ALL the customers who signed the NDA did so. What I can speak to is why my company (a decently large telecom company which shall remain nameless) did so: Sun had a software patch that they felt might help alleviate the problem (I still can't reveal the details of what it does or why). In order to receive the software, we had to sign a non-disclosure agreement. It's that simple: we have a problem. If you want us to solve it, sign the paper. Otherwise, shut up and wait along with everyone else. As we had business-critical systems that were affected, it's not hard to understand why management did not hesitate to sign the NDA.
  • Sun usually dose a decent job of fixing bugs and deffects...

    Suns track record on speed of fixes is slow vs other companys however Sun dose not leave constummers idle like some companys...

    Apple activly denied a defect in the original Macs before Apple fixed same.. Becouse Apple was decent about dealing with dead Macs from "Cause unknown" Apple users didn't bitch much about Apple pretending the bug didn't exist.

    Microsoft however behaves diffrently. Instead of prevending the bug dosn't exist Microsoft just says "Yeah sorry just turn that feature off" "Reboot" or "Reinstall Windows". They don't ever fix it but then don't say "It dosn't exist". They might say "It's not a bug it's a feature" but they still admit it exists.

    Sun just fixes the bug and says "Sorry for the inconvence".....
    In a world where we all make mistakes... and so many won't take responsability for them... it's pritty hard to bitch when someone is willing to do everything it takes to fix the problem.
    Mistakes happen... Can not avoid it...

    It's more sinster than an NDA...
    Sun accually takes action so nobody WANTS to talk about the defect....
    Thats right people... you can't sue Sun becouse Sun dosn't stop them at the legal level...
    They stop them at the ethical level....
    There is simply no desire to bitch....

    Thats a lesson.... people complain not becouse there is a defect... but becouse you won't fix the defect...
  • I recently got hired by AAXION Software (www.aaxion.com [aaxion.com]), and I saw an UltraSPARC machine for the first time. The ironic thing is that the company makes a daemon that monitors potential instabilities. We'd better add this to the list, hehe.

    Reply Topic/Flame Fodder: I saw two Sun-built stations there, an Ultra5 and an Ultra2. Which one's better, and what are the specs? I'm just getting into this Solaris field, and so far, it's my favorite UNIX clone (the CPU and Disk meters at the bottom are cool, wish they'd do that in Win2k on the taskbar [it's already done in Task Manager]).

  • by sabat ( 23293 ) on Monday August 28, 2000 @09:53AM (#821383) Journal
    Ebay's problem was that it was running new hardware (E10000) with a very old OS (Solaris 2.5, not even 2.5.1) and a version of Oracle that had documented problems with that version of Solaris.

    It had nothing to do with RAM, although I'm sure their former IT director would love to claim that.
  • by softsign ( 120322 ) on Monday August 28, 2000 @12:37PM (#821385)
    If my computer is crashing, one of the very first things I do is get on the 'net, check www.deja.com, and see if other people are having similar problems

    Yes, I see the parallel. You're having some trouble with Samba, post to your Linux newsgroup and within a few hours you may have a few people who've experienced the same problem offering a solution.

    There's just one problem. Your computer is not an Enterprise 10000. How many people do you know that have an E10000? And out of those hundreds, how many do you know that are identically configured?

    This isn't some run-of-the-mill, I-just-installed-RH6.2-from-the-ISO-and-can't-conn ect-to-the-Internet problem. When people have problems with a system like the E10000 they call the people who know the E10000 best: Sun.

    You aren't going to find many employed administrators who have a habit of disclosing detailed explanations of their E10000 troubles on Usenet, hoping to find some help from their competitors.

    The reality is, if you've got an issue with an E10000 that Sun can't help you with themselves, then there ain't nobody else that's going to help you fix it, either. An NDA is really kind of redundant and I suspect it's just a legal exercise more than anything else.

    --

  • Since when has Sun turned into Apple?

    Even the samurai
    have teddy bears,
    and even the teddy bears

  • I did convince my boss not to buy Microsoft anymore. The problem is there is no way to convince all of the other employees not to buy Microsoft anymore, thus leaving us having to learn the newest mistake^H^H^H^H^H^H^Hoperating system from those lovely people in Redmond while all of the servers are quietly switched to Linux.

    I couldn't however convince him not to buy Intel anymore and no one in the company wants Sun... oh well 2 out of 3 at least.

    P.S. The servers' best uptimes:
    NT: over a year (got shut down for Y2K, never was stable again)
    Linux: just over a week (still being tweaked)

    Devil Ducky
  • Sun is way too powerful for its own good. Some people like Sun better than Microsoft because Sun has come out with good technology like Java and Solaris, but in fact, I assert that Sun is not any better as a corporation. It is just a dangerous as Microsoft, if not more because it's more trusted.

    I really hope that GNOME is in good hands here but I can't help but shudder at Sun's involvement. :(
  • "As soon as we reported the issue to Sun, the affected processors were replaced under service contract," he said. The company was able to resolve the problem by rearranging "our data center with the express purpose of lowering system temperatures," he said. "The systems run 10 to 15 degrees Fahrenheit cooler than before, and we haven't seen a problem since."
    It makes you wonder how hot he was running these things.
    Anyway, the NDA's thing was pretty stupid. Talk about pissing off your customers. Had we experienced a problem with our Sun servers and they asked me to sign one, I think I probably would have told them to fsck off, fix it or I buy (insert Sun competitor name here).
  • by Anonymous Coward
    This is a fairly well known problem. There was even a Gartner group report on it some time ago.

    There is also some significant work going into the USIII to make sure it's not an issue there.

    I like the followup saying we should also switch to VA Linux servers. That's a nice idea, but which product can we use to replace our 28 cpu systems with 28 GB of memory?

    Really people, just because you don't know about something doesn't mean it isn't already well known in the Sun/Solaris community. This isn't news. Hey, SUNW is up $2 1/2 on this "news".

  • by barracg8 ( 61682 ) on Monday August 28, 2000 @10:27AM (#821391)
    • 'That sounds like a good deal, but I have a better one. I give you the finger, and you give me my phone call.'
    Trouble is, you would get about the same reaction Neo did.

    Think about it. There is nothing legally requiring Sun to deal with problems in the order that they are informed about them. There is nothing wrong with Sun implementing a high priority queue, of people who sign NDAs, and a low priority queue, of people who don't.

    So you face a decision take the red pill, and you get your website back up and running. Take the blue pill, and Sun gets a bit of bad press, and you go bust.

    If you are someone like Ebay, it really comes down to that. You are your website, and you must sell your soul to keep it up 24/7 (or the best you can).

    Here's a little story:

    I know of a UK company who had a problem with Win95. It crashed every 49.7 (I think) days. So they went to M$ UK. They were told it would cost tens or hundreds of thousands of £ for M$ to look into the problem. M$ knew the company had no clout, and could not afford this, so they decided to fuck them.

    The company had some form of relationship to a larger US company, so they got them to take it to M$ in the US. This time, M$ insisted on the company signing a NDA. When they did so M$ admitted that this was a known flaw in '95. The clock didn't wrap nicely, so when you reach 2^32 milliseconds - 49.7 days (as I remember) Windows 95 (at least version A) crashes.

    M$ has since admitted publicly.

    People like micros~1 and Sun have reputations to keep, and a great deal of power. When you are dependant on them for your businesses survival, they can make you their bitches.

    Chalk it down on the 'List of Good Reasons to Use Opensource'.

    G

  • Some of my customers have run into this problem. Since it occured while they were running some extraoridinarily intense simulations on our sofware, they thought it was our problem. They called back to tell us that it was a SUN problem.

    On a similar topic, Solaris 8 has an interesting bug in it. It can't read HSFS CD-ROMs. This was encountered while a SUN engineer was loading a demo version of our software. He swore up and down that it was our problem. He e-mailed me this morning and confessed the true problem.

    In case you are interested:

    http://www.quantic-emc.com

    TTFN
  • Mum? Sorry, that is a word I don't understand.

    You see, in Australia, the word 'mum' is what you Americans would call 'mom'. The word 'mom' doesnt exist in Australia.
    -----------
  • Gee, they drag Micro$oft into court, and simultaneously bash Win2K about it's 65,000 bugs (anyone find all of them yet? It's buggy, but not THAT buggy)... When was this? Gee, 18 months ago! Coincidence?
  • by Sangui5 ( 12317 ) on Monday August 28, 2000 @12:47PM (#821395)
    Think about it. There is nothing legally requiring Sun to deal with problems in the order that they are informed about them. There is nothing wrong with Sun implementing a high priority queue, of people who sign NDAs, and a low priority queue, of people who don't.

    Taken from the Sun website:

    (3) CUSTOMER-DEFINED PRIORITY AND RESPONSE TIME:

    When Customer's designated Contact calls for support assistance, Contact will assign a priority rating to the call: URGENT, SERIOUS, or NOT CRITICAL:

    URGENT (system unusable) - Live transfer of service request. Personnel arrive at the installation site within an average of two (2) hours of service request for on-site hardware support assistance.

    SERIOUS (system seriously impaired) - Callback within an average of two (2) hours of service request. Personnel arrive at the installation site within an average of one (1) business day for on-site hardware support assistance.

    NOT CRITICAL - Callback within an average of four (4) hours of service request. Personnel arrive at the installation site within an average of one (1) business day or at a later mutually convenient time for on-site hardware support assistance.
    ...
    (17) SYSTEM AVAILABILITY GUARANTEE: For properly configured, maintained and administered systems, Sun will commit to maintain certain levels of System Availability. System Availability Guarantees require a separate contract addendum which will contain the specific terms of the Guarantee.

    This is from the Platinum Warrenty, which is standard with a E10K (what EBay runs). They have a contractual agreement with everybody that they sell such a standard configured E10K to have an average response time on urgent calls, and even on the most minor problems, within an average of one day, if no other time is convienient.

    In addition, if your web site is that important to your business, you can have a separate system availability guarantee. If Sun has agreed to provide five 9's, then they get 5 minutes 15 seconds of downtime a year. Even if they only have to provide three 9's, that's still only ~ 8 hours downtime a year.

    Sun makes their money by providing very reliable hardware, guaranteeing obscene quantities of uptime, charging an arm and a leg, and then delivering on all of their promises. If they don't deliver, then they will get their asses handed to them in a breach of contract lawsuit. If people agreed to an NDA, it was either Sun doing a very good job of talking fast, or promising better service than what they had contracted for. Any business which had to sign that NDA in order to stay afloat should have invested the extra money in a better warranty agreement, because if your web site is that important to you, you should spend the extra cash to get your uptime guaranteed and contracted.

    Business types don't really mind really expensive hardware/service agreements. Those are nice, fixed, predictable costs, especially if you have contracted with a reliable vendor (Sun). What they hate is having to lay out a bunch of money that they didn't plan for, because something unpredictable went wrong, and they didn't have their risks hedged. Hedging other people's risks is Sun's bread and butter.
  • Has enyone experienced these errors running Sun's Netra servers? The Netra FT's for example have CPUsets based on the E450 hardware architecture. I'm pretty much scared w1tl3ss that we may encounter these probs on our telco servers...... Anyone out there that has - please reply!!!!
  • Or check http://www.netcraft.com/whats/?hos t=www.ebay.com [netcraft.com].

    www.ebay.com is running Microsoft-IIS/4.0 on NT4 or Windows 98

    Fh

  • in these connected days, there is a concept called intellectual property, and it's given a great deal of creedence by those in power (with money). With that comes all those nice DMCA thingies, and everyone wants assurances that these cool little boxes that let us communicate, function as more efficient tools to funnel money more quickly from our pockets into theirs (or for those of us with stock options, a leeeetle bit back into ours).

    We look back to the days when nobody had enough money, and people were starving and the kids weren't getting educated, and nobody could afford to fix the streets or build enough atom bombs. It was scary back then. Now we know that if we kiss-up to the money-god enough, the money-god will bless us with enough cash to do GOOD things, like feed poor people. (or make them rich enough that they can feed themselves). That justifies a LOT. When you say to yourself as you slip on your pajamas: "It will make our company more profits, which means I will be able to order Muffy (my 16 year old daughter) the Lexus with the leather seats instead of cloth, and the company's stock price will go up, bolstering investor faith in the US market and tech industry, improve the economy, broaden the tax base, allow us to build a stronger military so we can intimidate these third-world dictators into keeping the oil prices down, so I can afford to put gas in Muffy's Lexus, which improves the image of America as a strong and prosperous nation, and helps poor people get off the welfare teat, get jobs, buy themselves Lexuses, give money to those charities that help poor people in third-world countries whose dictators have their tails between their legs because of all those cruise missiles I paid for in my IRS bill - yup, no matter WHAT I do, it's good." - it really makes it easier to sleep at night.

    It's called "rationalization". Generally not the BEST replacement for a sense of ethics. But it passes.

    if it ain't broke, then fix it 'till it is!
  • If you are a manager, and the choice is take the NDA, and get good service, or risk going bust - there is no decision to make here. Let someone else make a stand: principles are nice, working webservers/databases are much, much nicer when your business is on the line.

    You don't understand. What makes you believe that signing an NDA is going to give you better service than not signing one? Just because Sun says so? Corporations say a lot of interesting things when they believe they are in trouble and not all of them are true. It has been pointed out many times here that you are giving away your leverage for some promises to do what they have to do anyway.

    So long as everyone plays nicely for Sun, they get all their customers to sign NDAs, and their customers all get good service

    Oh, you mean Sun can give good service to everyone, but chooses to give it only to those who sign NDAs? In other words, no NDA -- no service? That's at least a breach of contract and I am sure a lawyer can find plenty of other things to sue for. Legal niceties aside, it stinks to high heaven as well.

    Kaa
  • Firs they have a reputation for stability because they earned it.
    Now they want to keep it through legal agreements preventing people from reporting failures in the system.

    It shouldn't matter if Sun knows what the problem is or not.. if it's a commercial product (not beta) and it fails, especially at that price, the consumers have a RIGHT to know.

    I would almost think that forcing customers to not reveal flaws in your system should be illegal.. it's very anti-consumer.
  • The problem is relatively rare and it only shows up on 400mhz modules with 4MB or 8MB ecaches. If you're seeing ecache errors on that many CPUs, or if you're seeing them on other USII types, then there is something else wrong.

    Assuming you aren't just making this up, it sounds like you have flaky power or the room is _way_ too hot.
  • I have to post this as an anonymous coward because I'm still associated with some of the customers/clients/parties. But hey, might as well tell people.

    I worked at a datacenter with well over 100+ Sun boxes. About 3 years back, we started getting Suns blowing up with this self-same Writeback Cache Parity Error. I logged a support call with Sun. We had an excellent contract with Sun, so I got a person very quickly.

    This person, I guess he was new, found the problem mentioned in the database, and sent along the mail/logged ticket associated with the problem. What it said in the mail stunned me.

    I'm paraphrasing, but this is what the engineering report said:

    1. Yes, any CPUs of that speed of Revision .04 blow up.
    2. No, there's no software fix. It's in the Revision.
    3. The solution is to replace them with Revision .05 CPUs.
    4. We have not produced enough Revision .05 CPUs.
    5. SOLUTION: TELL OUR BIGGEST CUSTOMERS FIRST, THEN DO OTHER REPLACEMENTS ON A PER-INCIDENT BASIS.

    I am not fuzzing this over; this was what the report said. Essentially, when this thing breaks badly enough that a call is logged, then replace it wholesale, otherwise, let the thousands of people using this CPU revision hang. It was a level of cold you normally wouldn't get in your face, and like I said, the employee probably didn't know what he was sending to me.

    But there you have it. This isn't the first time with Sun. And it was the same goddamned error!

    Of course, back then I didn't have Slashdot to bitch to. :) Cheers.
  • Let me tell you my sun story.

    First, the background.
    I've used Sun for years in various projects and jobs. I like Sun. I know what it's like, what it's capable of.

    So.. my company needed a couple workstations. I already knew what I wanted. So. I called my local Sun office and asked for a quote.

    Then.. this sales guy *insisited* on coming to have coffee. Okay.. sure. no sweat.

    He brings his 'engineer' with him. While sipping our fresh coffee, I show them around our place, tell them about what we do... and explain to them why I need the two workstations. I show them my *already* new network room freshly populated with servers.
    What do they do? They sit down with me to give me their 'presentation' about how great Sun is and how crappy everyone else is, and keep trying to convince met o buy Sun workstations. HUNH? I think? WHAT? I alreadyh TOLD them I was going to buy them. WHy are they still trying to sell them to me?
    Oh. And THEN they got on about servers. I had to cut the meeting off, saying 'Look fellows.. I do know about your servers... I just bought servers, and there is no way it's changing for the moment.'

    Of course, then they invited me to their demo center to see how their little 450 acting as a 'file server' was so much cooler than the NetApp filer that I was about to buy... okay, I thought.. I'll go see that.

    The entire meeting consisted of some guy from Sun showing me their 'NT' integration package, how it does CIFS and how it does domain control, and explaining how it was derived from actual MS source code. Whoppie, I said. I *have* NT servers to do this stuff. Does it do dynamic NIS to NT domain mappings? Oh.. no. Does it let me edit NT ACL's with vi? No.. sorry, they didn't know. Oh, and in order to make it *just* like NT, it has the same bugs in the file sharing code.
    Great, I said. Guys... I want the benefits of unix here... not just another NT box. You are offering me a file server solution that is a) more expensive b) only has software raid and c) although i'll grant it has Solaris on it, and is flexible, it still doesn[t have some of the basic snapshot and backup features of the NetApp. And it's far slower at file serving.

    I pointed out (politely) that here they were, demonstrating me a product that was designed to get NT admins to get into Sun (and NOT designed to let unix admins do anything cool), even though I already explained to them that I *LOVE* solaris, and already know this crap.

    Then, they phoned and phoned and phoned.

    Now.. this is *NOT* the behavior I expect from a professional company when I am nice enough to call THEM already ready to buy something.
  • um, 8.5 years in the Tech Support industry tells me this: - in cases like this, "free consulting services" means, an engineer is flown in to try to figure out what the hell went wrong because they couldn't reliably reproduce it in-house.

    That's like Firestone sending a guy to the ditch on the side of the road where your SUV lays, twisted and burning, to take a look at your tires, scratch his head and go: "gee, that's not right, is it?".

    I wouldn't ask Sun to brag about the problem. I do, however, believe that waving a carrot on the end of a stick, and holding up an NDA is dishonest. I could see asking customers to keep mum, but requiring them to sign an NDA goes to a place that makes me nervous. It's because that's one of those things that should make that little voice inside your head that tells you you're doing something wrong, speak up and tell you you're doing something wrong.

    if it ain't broke, then fix it 'till it is!
  • Perhaps it's the joker with the HERF gun next door..
  • by Anonymous Coward
    In fact, caches are more likely to have solar radiation problems than DRAM. Large caches especially.
  • Because it is a rare instance when a corporation gets called to the carpet. Corporations are screwing up (and consumers), big time, on a daily basis. We only hear of a miniscule number of situations where the information leaks out, AND someone decides that its news worthy. SOP, and then send out the PR clowns when the SHTF.
  • It's all anecdotal evidence, but at my site, we have a number of E10k's, and other than the occassional kernel updates and custom drivers, our E10k's have been up almost 100%. Granted, software stability is another issue, but that's my problem :-).


    --
  • If you go back and read the article, you'll see that this isn't a system RAM issue at all... it's the CPU Cache:
    The problem involves an external memory cache on Sun's UltraSPARC II microprocessor module.

    The article later mentions that it seems confined to 400MHz CPUs with 4 or 8MB cache.
  • by Anonymous Coward on Monday August 28, 2000 @10:32AM (#821410)
    Nonesense... they are still having problems and have had constant hardware problems since they have been there. 4 hardware problems in the last few weeks. Get a UE10000 and watch it act like a yo-yo!

    User: aw@ebay.com

    Date: 08/09/00

    Time: 21:24:33 PDT

    *** TECH MESSAGE ***

    Recently we have experienced several issues that have impacted eBay's availability. We want to take a moment to update you about our situation and the things that we're doing to address the issues.

    First, over the last few weeks, we have been making a number of "headroom" improvements to the entire system to ensure the scalability of the site for the future. Normally, making these improvements should be invisible to you. Unfortunately, this was not the case.

    These changes resulted in availability issues with My eBay and Seller Search during high traffic periods. There were a number of fine-tunes that had to be made, as well as code issues that had to be addressed, to resolve this problem.

    We believe these issues have been resolved. To be sure, though, we will continue monitoring the system through a few more "prime times" (hours when traffic of the site is at its heaviest).

    Second, we have experienced three hardware failures in the last 10 days that have resulted in system downtimes, including the one tonight. During each failure, we have migrated to our backup system as quickly as possible to restore system availability.

    Later tonight and during our regularly scheduled maintenance on Friday morning, we plan to make additional improvements to the system to help address the hardware issues.

    System stability is still our number one priority. We appreciate your support.

    Regards, eBay

    User: aw@ebay.com

    Date: 08/09/00

    Time: 19:57:52 PDT

    *** SYSTEM STATUS ***

    The eBay system is currently available.

    At 19:15 PT, we experienced a hardware failure on our main server. We migrated to our backup system, and the site became available at 19:57 PT. Please accept our apologies.

    We will continue to carefully monitor the system and will inform you of any changes in its status.

    Regards, eBay

  • by bigdogs ( 90229 ) on Monday August 28, 2000 @10:32AM (#821411)
    You're partially right. They did encounter some known Oracle problems, specifically with intimate shared memory.

    They were definitely *not* running 2.5; an E10K requires >=2.5.1. Their high profile outages last year were while running 2.6.
  • Get a UE10000 and watch it act like a yo-yo!

    Funny, my company has one.. and it's pretty damn solid. There were issues when it was new, due to the same things the person you responded to said: our people were using an older solaris (2.5.1) and custom software. Once the kinks were worked out it was found to be rock solid. Makes one hell of a spam mail generator too (got paged when oncall one night because someone's script got into an infinite loop and was sending 1000's of mails every second to his account, load on the machine went over 200, but it still did it's job and didn't crash).

    BTW, if the admins at eBay simply did something wrong or made a mistake, do you think they'd admit it publicly? "Hardware failure" is always an easy excuse to give to people.

  • Fair point, but there was a reason for '95:

    This was a software company. The software was a piece of server software, so customers ought to be running it on NT, but they wanted to be able to say it would run on all M$ OSes, so they were testing it on all OSes they could lay their hands on ('95, NT, and some betas, I think - this was a few years back now).

    I was not working for the company, so I cannot say any better than this. Ultimately, I think that the product remained in development for a long time, so I think it was released after '98 came out. I *think* the problem was fixed by then (not sure) so I don't think it was ever a problem in the real world, I was just describing M$'s behaviour.

    G
  • That hasn't happened in a while... ;)

    I had a hardware freeze in a brand-new E450 today (4x400 UltraSparc II's with 4 meg cache) and was trying to figure out what might've caused it when I saw the article.

    That makes sun 0/1 so far...
  • Yeah, point taken, and good information.

    Nevertheless, I would still submit that whatever you say, when the Sun engineers arrive at Ebay, the Ebay suits would cheerfully suck the engineers' dicks if they thought it would get the server back up a minute sooner.

    Okay, I'm sorry, maybe there is no need to get offensive here, but my point is, would you, honestly, not sign that piece of paper?

    G
  • As a sr solaris sysadmin, who has worked on Sun boxes for years, I have /nothing/ but praises for Sun service and support. Sun QA is top-notch, in comparison to the rest of the tech industry. I got my start in Linux, and still use it a great deal. At home, all but three of my boxes run Linux, including several PCs and a Sun 670MP. I also use various BSDs. Pretty much, so long as it's Unix, it's ok by me.
    Bearing this in mind, realize that I am capable of obejctive, honest review.
    Sun has done more for the free software community than anyone therein seems to want to acknowledge, even though they are threatened by Linux. They are a large company, and do have their share of corporatism, but they also get an unfairly bad rap in the Linux community, for reasons I do not comprehend. Sun hardware has always been the industry standard for rock-solid reliability, and IO bandwidth. They never have been the blazing speed machines.
    Going back to Ebay, where people were asking whether this was a problem with the cache (It is not a RAM issue, but an issue with the cache on the 400Mhz UltraSparc II processors, and I have /never/ seen it outside of 2x400 configuration in an Ultra II).. It wasn't. Ebay was a victim of bad sysadmins. Perhaps they were very good sysadmins, who had no idea of what to do with an E10k. Perhaps management made the decision for them. (This happens with eerie regularity)
    The fact of the matter is, the E10k is not a 'super-processing-power' box. It's a 'IO pumping, high-availibility' box. The sysadmins at Ebay had the E10k running flat out, not partitioned (As they're meant to be run) in quadrants. They grew so fast that they put the other E10k into production in the same fashion, instead of using it as a hot standby. Each E10k was a single point of failure, with the ability to be multiply redundant internally removed. A single problem with an OS that wasn't even officially supported on the E10k running at an invalid patchlevel caused a very highly publicised downtime. Instead of blaming bad setup (Which would be disasterous for investor relations), Ebay blamed Sun.

    As to the latter part of this article, I know nothing about Sun covering up that problem, (Which I have seen before), but don't deny that Sun, being a big corporation, might do such things, as all corporations are wont to do, even the ones very popular in the Linux community. Usually that problem manifests itself in the system log long before any problem is ever seen. This problem is also listed on Sunsolve. [sun.com]
    Sunsolve is one of the most open policies I've ever seen to system-related issues. The only group of people that even come close to that level of support is Debian.

    While I know this was rather long-winded and might generate lots of flames, I do mean it. Don't bash Sun summarily, and don't bash Sun on QA. It's like talking about raising "Serious questions about Honda QA" if Honda issued a recall for defective OEM tires (A year after the vehicles with those tires were issued). Almost nobody would think to bash Honda QA over a single issue. Sun may have had a few quality issues from time to time, but so does everyone. And at least Sun is actually saying something, unlike companies that deny forever.

    Why bash Sun, and not Intel - Another /. headline for today.

    -Kysh
  • That's an entirely different issue. I can almost *guarantee* you that you had poor airflow on those Sparc 5's. Either they were on their side (not the way to set a Sparc 5) or they had so much dust and lint on the intake side that no air got in. Either of these causes heat issues and that leads to premature hard drive and memory failures (moreso than motherboard failures.) I supported a site with 2500+ Sparc 5's and that was the problem 98% of the time.
  • Imagine the pissed-off customer having to sign this. Hope the customers now have concrete plans to move over VA Linux or other such Linux servers.

    Please. As much as I think Linux is a great operating system, x86 hardware and Linux still can't touch the performance and scalability of Solaris on Sun SPARC hardware. I maintain an application which runs on dual Sun E6500s each with 10GB of memory and 30 UltraSPARC II processors. A Linux box couldn't touch this right now. The backplane is MUCH faster than any multi-CPU x86 machine and the fact that it can juggle 30 processors is something that Linux on x86 simply does not have going for it right now. Maybe someday, but certainly not now.

    --

  • >Also, do understand that these sort of NDA's are somewhat common
    >when dealing with potentially explosive matters like this.
    >Certainly Sun is interested in keeping tight lips, but
    >they also would prefer to announce a solution along
    >with the problem. I'm sorry, but this is total bullshit. This might have worked back in the pre-'net days, but fortunately those days are behind us. If my computer is crashing, one of the very first things I do is get on the 'net, check www.deja.com, and see if other people are having similar problems. It's good to know that other people are having similar problems! And, if anybody has gotten a satisfactory solution, you can demand that same solution. Keeping people in the dark is so neandertal I can't quite believe that anybody can defend it anymore.

    I'm sure that Firestone would have preferred that nobody talk about their tire problems, or that the makers of Rezulin would have preferred people not talk about those annoying liver-failure deaths that occured, but it's just pigheaded of them. And, in these connected days, it will not stand.

    thad

  • by Anonymous Coward on Monday August 28, 2000 @11:00AM (#821428)

    I'm not representing Sun in this post-- just the facts. Get the facts straight:

    It affected very few customers. (Sun has bent-over-backwards for customers to fix this. This included free consulting services to some sites and Sun did not use NDAs to gag customers on this. The NDAs were for disclosure of strategic planning regarding future systems/products and service offerings. How can a company keep customers that have had problems without being friendly to them?

    It's an issue that only affects 8GB cached 400Mhz processors.

    It's an issue w/ CACHE on a CPU NOT system memory.

    The problems usually crop up in systems in poorly maintained data-centers. This includes centers with large temperature fluctuations, poor voltage regulation, poor humidity controls, and improper grounding. "User-error" and "misuse" exasserbate the problem.

    The problems are limited to a particular production run of CPUs. New CPUs don't have this problem.

    Sun hasn't denied a problem. Sun hasn't bragged about the problem, either. Would you?

  • I'd like to semi-anonymously state that I work for one of the companies that has experienced this problem on E10K's. Sun has been working with our company on this for quite some time. I have to say that it seems to me that the problem is largely environmental.

    While Sun could have manufactured the part to be more environment-tolerant, the users can also virtually eliminate the problem by operating their datacenters well within spec.

    Therefore, I don't completely blame Sun for this. Even if there was no environmental influence, and it was just pure manufacturing flaw... the rates that this flaw happens at are fairly low.

  • In general, I have a sneaking suspicion Sun is worse than Microsoft. Microsoft really never tried to deny it was monolithic. Every once in awhile it would try to do something spry, but it knew it was/is a lumbering beast, carried more by the momentum of the marketplace than by its latent innovation.

    Sun, on the other hand, has laid quiet, touting its successes where and whenever possible, covering up its failures, helping demonize Microsoft (unite the people behind a common enemy), and not really living up to its promise as the superior technological company.

    With Microsoft losing its Imperial hold, Sun is beginning to look like a pretty shifty company, casting doubt on its commitment to its customer base.

  • I think everyone's missing the obvious solution for this problem. Just take the red pill, sign the NDA, then post away all you want to Slashdot as Anonymous Coward. Simple, really.
  • I'd sign the NDA in a second - If they promised to build me a Lego Office.

    Hey, boss. The bad news is we're still crashing. But just look at the pencil sharpner I built using #2371 and #1726.
    --
  • The nondisclosure agreements were apparently offered with a claim that signing them would bolster Sun's commitment to resolving the problem quickly, Henkel said.

    IANAL, but I have picked up tidbits about contract law. Basically, if you sign something away, the contract is only binding if you receive something of value in return. Be it information, cash, services, property, or whatever.

    A half-assed lawyer could convince a judge that the customers got nothing for signing the NDA, since they were (presumably) already entitled to timely fixes by warranties or service agreements.


    My mom is not a Karma whore!

  • Show me a PC that still costs that and has:
    1. Built in ability to boot off the net or _any_ other device you want, and can be set to default to that.
    2. Serial console ability out of the box.
    3. Massive online support center (sunsolve) from the vendor.
    4. If hardware is made for the system, it _will_ work, period (I've run into plenty of PC hardware that doesn't play nice on some mobos or in combination with some other cards).

    Sun hardware is expensive, and support is costly, but you damn well get what you pay for. We can get Sun here within an hour for critical issues, and the next day at the latest for non-critical. How's your local vendor and/or manufacturer on similar issues with PC hardware? I've had PC stuff fail and it takes forever to get replacements (unless the store's open, they have it in stock, and will let you return it). It's usually faster to just buy a new part.

    BTW to your #3: Which OS is native for PCs? DOS? Windows? Linux? Personally I prefer Linux, but PCs weren't designed for any specific OS. Sun hardware was. Linux runs nicely on it though. :)

    Oh yeah, can you run 64 bit on that PC? Didn't think so.

  • > Wow does that sound like a bad idea! Talk about giving up your leverage.

    N0 5417.

    I would have said "How 'bout this non-disclosure: I won't blab it all over the internet, if you have if fixed in four hours."

    --
  • I've never seen this one happen. But then again I've only got 3 machines to your 50. However the grounding strap sounds like a great idea in any case.

    WWJD -- What Would Jimi Do?

  • "I like the followup saying we should also switch to VA Linux servers. That's a nice idea, but which product can we use to replace our 28 cpu systems with 28 GB of memory?"

    What about something like this [ibm.com]? And if that's not big enough, try the RS/6000 SP...
    Or you might try going with Compaq/Alpha, who also have some pretty decent machines that can scale up to 32 [compaq.com] CPUs in a box and even more for the SC series.


    Okay... I'll do the stupid things first, then you shy people follow.

  • .. but I have a question. I know that most companies require NDAs when they are having people beta test their software, (I know I had to while I was beta testing Windows Milennium) so I'm wondering why a company would agree to not say anything about Sun's faulty memory.

    Are the companies getting anything out of this at all? I know that if I was in their position, and some corporate goon (tm) from Sun came along and said:
    'Yeah, we know you're having crashes and frequent reboots for no reason due to our memory, so we want you to sign this saying that you won't mention the problem to anybody. We won't offer anything in exchange, not even to replace the RAM, but rest assured, we're trying to fix it as fast as possible.'
    I'd enter into my Matrix-esque quote, 'That sounds like a good deal, but I have a better one. I give you the finger, and you give me my phone call.'

    --
    CitizenC
  • by Anonymous Coward
    Sun customers who have been affected by the problem are unwilling to speak openly about it because Sun has persuaded many of them to sign nondisclosure agreements... [the] nondisclosure agreements were apparently offered with a claim that signing them would bolster Sun's commitment to resolving the problem quickly

    Oh come on. All it took to get these people to sign nondisclosure agreements was a promise that Sun would work harder to fix the problem? Give me a break!

    Nobody decided to say, "yo, how abouts, you fix it now, and I won't find another server and tell all my friends about this issue."

    There must've been something else in the offering to keep these clients quiet... I wonder what it could be...

  • by Anonymous Coward
    I worked as a contractor for Sun in the first tier hardware support center in Burlington, MA from Nov 99 - Feb 01. This was a commonly known problem and was by no means treated as secret. There were a few syslog errors on the 400Mhz 4MB and 8MB cache processors that were tip offs. One, if I remember correctly, was a red state exception error and there were a few other cpu panics that were tip offs. It really was a low percentage of CPU related calls taken, but it was a known "mystery" problem. The procedure was to replace the CPUs immediately, and collect a core dump for the kernel team to analyze if save core was enabled and they actually had a dump. As techs we were not under any gag order on the phone. We could acknowledge there was a problem of which no cause or resolution was known and all we could do was replace the offending CPUs. Side note: my experience with Sun, as a contractor, was nothing but positive and they had a really great working environment in the support teams. We were always under instructions to side with customers and give service as best we could. Almost all of the reps would bend over backwards to help and give service on expired contracts, etc. Of course there were bad apples, but most people really wanted to help.
  • According to the article:

    The nondisclosure agreements were apparently offered with a claim that signing them would bolster Sun's commitment to resolving the problem quickly, Henkel said.

    I'll tell you what really bolster's a company's commitment in fixing a problem, and that is realizing they are going to lose a lot of business if they don't fix it. An NDA simply means they get to take their time, fix it on their schedule, and not suffer severe ramifications for their actions.

    A more proactive company would have jumped right out there, admitted their problem, and outlined exactly what they were doing to fix it. That prevents a leak from making them look like slimeballs.

  • Fair point, troll, but only for the Sun example.

    Half of my post was dedicated to relating a similar tale of a large IT company forcing customers to sign a NDA before recieving any help.

    And this story was a perfect example of where open source pays off. If Windows was open sourced, an external developer could have gone through the code, and spotted a fault in the system that caused it to crash.

    As a very minimum, he could have stopped wasting his own time trying to find out if his software was crashing the system. At best, he may well have written a fix for the bug, and posted it on the Internet, so other users could make their Windows boxes more reliable.

    As Windows is a closed source product, for a long time M$ publicly denied that any such problem existed, and only would reveal this to developers under the cloak of NDAs.

    G
  • THIS problem isn't RAM related, either. It's the ecache on the CPU module that's causing the problem. It isn't DIMMs or DIMMs that are causing this.
  • We are the 'dot' in 'dotdotdot Why doesn't this f*cking .com site load?'
  • I've logged a kernel bug with sun for more than a year now. We opened tickets and got no real fix. After heavy threatening we got the vxfs licence for free which enabled us to work around the bug.

    As of now, I still can kill every cgi-enabled webhoster running Solaris/UFS.

    If those lame-assed, self-satisfied, over-confident bastards piss me off one more time, the exploit will be on bugtrack.

    /ol
  • My group runs a mid-size to large server farm at a (very) major ISP and we're constantly replacing processors that crash due to "Ecache Writeback Data Parity Error"s. We were told that even replacing the processor isn't a remedy because the new one is just as likely to eat itself as the old one if you left it in place (As far as they know.)

    In addition, the latest fix is a software patch that is supposed to massage the Ecache so that it never finds itself in the condition that they believe causes the error. Remember, they're still guessing at this point. 18 months later. How many of those 400Mhz are now used up with self-checks and Ecache scrubbing?

    Ever babysit a Sun E-anything on bootup? Not only does it cost the company tons of $$ in downtime (made more extreme by the long boot process), it also costs them $$$$/hour for their engineers to sit there and watch these things POST forever.

    I think the most aggravating part is how for all intents and purposes, Sun is now using the worlds largest enterprise sites as beta testers for it's product just like M$ uses the world to test it's software except that Sun expects us all to sign our voices away with the NDA so they don't look like a bunch of ..... (something bad that you wouldn't wanna be called).

    (Non?)sequitor question: has anyone been able to get Sun ftpd to log to syslog like the man page says it can?

    *** My opinions are my own and not necessarily the same as my employer.
  • Well, the money certianly was not thrown away on the "Loose Cannon" commercial. That one rocks! Had me completely fooled.

    However, I can do without the "Power of the Dot" space movie ripoff commercial.

    Visit DC2600 [dc2600.com]
  • by sterno ( 16320 ) on Monday August 28, 2000 @10:01AM (#821483) Homepage
    Perhaps Sun believed that the recent strong growth of Windows as a server platform meant that customers liked having sporadic reboots.

    ---

  • by TheGratefulNet ( 143330 ) on Monday August 28, 2000 @10:03AM (#821486)
    I was told by one of our hardware engrs that only very high density ram (256meg or more) really needs ECC support.

    then again, each year ram density increases so I'm not sure which density is considered safe and won't need ECC.

    for boxes that will stay up a week or more, I tend to buy ECC 'just because'. its not much more expensive and doesn't slow things down enough to care about (even though gamers and o/c'ers will disagree).

    --

  • by Anonymous Coward on Monday August 28, 2000 @11:51AM (#821493)
    Yes, that's a classic case of the problem. I'm a Sun support engineer of sorts by trade.

    A few points - The Ecache problem should, and normally only does, happen once on a CPU. If it happens more than once, Sun replace your CPU.

    The problem is severely localised, i.e. it seems to affect systems in certain system rooms. The problem only ever occurs when a system is idle on SMP systems (Who runs 400mhz 8mb Ecache UltraSPARC cpus in a single CPU system?).

    I've seen figures and graphs about the problem - It was a problem that has been generally sorted since about November last year, but the "one off reboot" problem appears to have persisted.

    The reason why Sun got customers to sign NDAs is that they were given in severe detail what the problem was, what Sun were doing to deal with it, and how Sun have generally sorted the problem. There's no large scale cover up, Sun gave customers extremely detailed and relevant information. This can only be a consistent problem if a customer doesn't act on the problem.

    This problem had precisely nothing to do with the Ebay admins problem with running their Oracle database. I'm posting anonymously for obvious reasons, even though I've given out far more detailed info. to customers with problems, and probably on occasion on USENET.

    Sun reacted to the problem with much vigor. I don't think you'd get the same detail from INTEL etc. Obviously Sun don't go out of their way to advirtise the problem. I've heard a few "solutions" to the problem conjured by folks on USENET, including tightening a screw on your CPU.

    There was no cover up. Customers got information. Customers who wanted more info. got detailed sensitive information on the problem, if they had a good case for it. Sun aren't about to force people to sign NDAs to use StarOffice.

    I don't believe this is as severe a problem as it used to be. I've seen the stats.
  • The memory mezzanine card on the 420s (and I guess 220s, though I have none in my shop) have an unfortunate design flaw. They need to be torqued down into their socket with a specific level of force. A special wrench is supplied with each unit, but since it's just a hex driver with a kink in it, I imagine a lot of people throw them away, not knowing what they are (like we did, until we realized how important it is to keep them!).

    As the machine operates, heat causes the mezzanine card and its socket to expand, and they are made of different material, so it will come unseated a little bit after a short while. Re-torquing the card makes the Ecache errors go away. After operating a few days and re-torquing, the card is finally seated permanently, and needs no further adjustments.

    It's totally a hardware problem. Thing is, the field engineers are the ones who have solved it (at least ours did!). That knowledge seems not to have trickled back up the food chain yet.

    Our Ecache errors are a thing of the past (knock on wood).

    By the way, you can make POST a lot faster by turning off the diag tests.

    Oh, and I don't use the Solaris ftpd, and I suggest you do not use it, either. Venema's ftpd from daemontools is the only ftpd I know of that has fixed the third-party-PORT bug that Sun's and WU-based ftpds suffer from.

  • Free Staroffice CDs!
  • by devphil ( 51341 ) on Monday August 28, 2000 @10:05AM (#821501) Homepage
    The existence of a problem with the 400MHz CPU with a 4 or 8MB cache has been well known on Usenet groups and a couple of online magazines. Sun engineers posting to the discussions say, "Yes, there's a problem. We think it might be foo, bar, or baz. Try the following steps..."

    I don't think I've ever heard or read anything about Sun denying a funky problem in those chips. They may still be looking for the precise cause, but every time the issue comes up, somebody from Sun generally admits to it.

    Dunno where the Gartner Group gets its figures from.
  • Something doesn't add up here - One company in this article mentioned that Sun helped them rearrange their data center to get the boxes cooler, then they stopped having problems. The other company essentially said that Sun hadn't helped at all, and had not forwarded any information.

    Why would Sun not tell customers to keep the machines cooler until a fix could be found? Really, how many bigish Sun machines are running in uncontrolled climates? I would think turning down the thermostat a few degrees wouldn't be a big deal compared to "frequent" crashes.

    Also, why would a customer sign an NDA with a vendor? Support levels should be stated in the support contract; any less and the lawyers get involved. This sounds like a device to prevent distribution of news, protecting stock price and corporate image. I always thought that should be done in testing/QA.

  • From my experience in the past few months with sun's equipment, I've had a number of failures with the cpu modules.

    I have about 15 of the systems at a customer site and another 15 an exodus. There have been no problems with the systems at exodus, only the ones sitting in the customers poorly vented and overcrowded equipment room. Of course the ambient temperature at exodus is such that a jacket is often required if you intend to stay for long...

    From the line about also checking the customers installation environment, my guess would be that the majority of the issues are that they chips are failing when put under environmental conditions that are near their posted maximums.

    People tend to forget that when you rack mount a server, you have to pay close attention to the airflow to that server. Sun's 4x00 series of systems are rather strange in that the airflow is right to left, not front to back, so you cannot put a 4x00 system into a 19" rack and expect it to work correctly simply due to the airflow issue. (This is why Sun's 5x00/6x00 come is such a strange looking rack with all of that empty space on the left... )

HELP!!!! I'm being held prisoner in /usr/games/lib!

Working...