LiveJournal Blackout Analysis Online

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

LiveJournal Blackout Analysis Online 333

Posted by CmdrTaco on Thursday January 20, 2005 @04:27PM from the when-it-all-hits-the-fan dept.

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

This discussion has been archived. No new comments can be posted.

LiveJournal Blackout Analysis Online

Load All Comments

Search 333 Comments Log In/Create an Account

Comments Filter:

Lesser OS... (Score:5, Funny)

by Anonymous Coward writes: on Thursday January 20, 2005 @04:28PM (#11423679)

They should be using OpenBSD. It can run right through power failures [grub.net]

Share
twitter facebook
- - - - Re:Lesser OS... (Score:2, Informative)
        
        by ergo98 ( 9391 ) writes:
        
        Power failed to get to the computers. It was a power failure - whether it was the electric grid, the UPS blowing up, or all the wires in the wall, or in this case the EPO button, it's a bloody power failure.
        
        Re:Lesser OS... (Score:2, Insightful)
        
        by ghjm ( 8918 ) writes:
        
        I'm about to leave work and go home. When I do, I plan to hit the so-called "power button." When I do this, code will execute on the box that flushes cache to disk and then commands the power supply to interrupt most (but not all) of its DC output. At that time, my computer will be in a state commonly referred to as "off."
        
        By your logic, I can claim that my computer is down due to a power failure.
        
        Perhaps you would complain: But power was getting to the computer.
        
        So what about the situation where I accident
The less we've learned... (Score:5, Funny)

by geoffspear ( 692508 ) * writes: on Thursday January 20, 2005 @04:28PM (#11423686) Homepage

Don't let your clients near the Big Red Button without an escort. Preferably an armed one.

Share
twitter facebook
- Re:The less we've learned... (Score:2)
  
  by stupidfoo ( 836212 ) writes:
  
  And don't have it red. Have it black. People, especially kids, love pushing that damn red button, no matter how many warning signs you put around it.
  - Re:The less we've learned... (Score:2)
    
    by cdrudge ( 68377 ) * writes:
    
    Code may dictate that it needs to be red.
  - Re:The less we've learned... (Score:3, Funny)
    
    by Chris Mattern ( 191822 ) writes:
    
    "The beautiful shiny button! The jolly, candy-like button!"
    
    Chris Mattern
  - Big red buttons (Score:2)
    
    by cbr2702 ( 750255 ) writes:
    
    So make a little black button and know where it is, but also make an big red one that turns off the lights. That way you get to yell at little kids without much harm to your system.
  - Re:The less we've learned... (Score:3, Funny)
    
    by geminidomino ( 614729 ) * writes:
    
    Evil overlord list item #9: I will not include a self-destruct mechanism unless absolutely necessary. If it is necessary, it will not be a large red button labelled "Danger: Do Not Push". The big red button marked "Do Not Push" will instead trigger a spray of bullets on anyone stupid enough to disregard it. Similarly, the ON/OFF switch will not clearly be labelled as such.
- Re:The less we've learned... (Score:2)
  
  by Lispy ( 136512 ) writes:
  
  Offtopic but hey:
  The red button will eternally be linked in my brain to the one in the pool of ManiacMansion that reads "Do not push" and wich everyone i know pushed anyways. ,-)
- Photo of the button (Score:2, Insightful)
  
  by teneighty ( 671401 ) writes:
  
  Apparently this photo [kekkai.org] is an example of the button that was "accidently" pressed.
  
  I'd love to hear the explanation for this "accident".
faulty mobo's (Score:5, Interesting)

by Lifthrasir ( 646067 ) writes: on Thursday January 20, 2005 @04:29PM (#11423697)

so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?

Share
twitter facebook
- Re:faulty mobo's (Score:2)
  
  by wankledot ( 712148 ) writes:
  
  The solution is even funnier...
  To get them back up they need somebody at the NOC to plug them into a compatible switch, let them autonego, then switch them to their real switch.
  This is how a company with Millions of paying accounts runs its data center, and they even knew about the problem!
  - Not millions of paying accounts. (Score:5, Informative)
    
    by EvilStein ( 414640 ) writes: <spamNO@SPAMpbp.net> on Thursday January 20, 2005 @04:57PM (#11424042)
    
    Actually, most of the accounts don't pay. They're just freeloading whiners.
    
    This is a paste from the Livejournal stats:
    
    * Free Account: 5713743 (98.3%)
    * Early Adopter: 14220 (0.2%)
    * Paid Account: 94857 (1.6%)
    * Permanent Account: 1632 (0.0%)
    
    Parent Share
    twitter facebook
  - Re:faulty mobo's (Score:2)
    
    by tchuladdiass ( 174342 ) writes:
    
    Also, you should never rely on autonegotiation -- there are no standards. That's what ethtool or mii-tool is for, or at a minimum specify speed/duplex in your /etc/modules.conf file.
- Re:faulty mobo's (Score:2)
  
  by BridgeBum ( 11413 ) writes:
  
  Maybe faulty, maybe not. There are a lot of incompatibilities and general "flakiness" with some network auto-negotion interactions. It's a fairly standard precaution in large network environments that servers should not rely on auto-negotiate and instead should have their speed and duplex settings hard-coded.
  
  In reality, the only places where auto-negotiation is important are mobile devices (laptops) which may connect to a variety of network connection types or for the home user "plug-and-play" market. Ma
  - - Re:faulty mobo's (Score:2)
      
      by ignorant_newbie ( 104175 ) writes:
      
      >We have 9 machines with faulty motherboards
      >with embedded NIC
      
      so basically, they're using shite hardware because they're too cheap. bet they've noticed by now that it costs less to use good hardware than to try to fix it later when something goes wrong
503 pages (Score:2)

by Folmer ( 827037 ) * writes:

Now, if slashdot could fix their servers, so we wouldnt get thoose annoying 503 sites..
I havent seen them that much lately, but then i havent been online that much either...
- Re:503 pages (Score:2)
  
  by Rosco P. Coltrane ( 209368 ) writes:
  
  Now, if slashdot could fix their servers, so we wouldnt get thoose annoying 503 sites..
  
  You get 503 sites? I only reach one at slashdot.org
  
  Then again, you're a subscriber. Who knows what goodies you lucky few get here...
Oppsie (Score:5, Funny)

by darkstar949 ( 697933 ) writes: on Thursday January 20, 2005 @04:29PM (#11423705)

"I'll just set my coffee down here, and..."
...
"Oppsie, I hope that button wasn't anything important."

Share
twitter facebook
- Re:Oppsie (Score:2)
  
  by Gary Destruction ( 683101 ) * writes:
  
  You mean that big red button wasn't the coffee maker? Oops.
  - Re:Oppsie (Score:2)
    
    by superpulpsicle ( 533373 ) writes:
    
    You mean that Staples commercial with the big EASY button is not a real product? I was waiting for it to go on sale.
History Eraser Button (Score:5, Funny)

by bsd4me ( 759597 ) writes: on Thursday January 20, 2005 @04:30PM (#11423706)

Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...

Share
twitter facebook
- Re:History Eraser Button (Score:5, Interesting)
  
  by scribblej ( 195445 ) writes: on Thursday January 20, 2005 @05:06PM (#11424157)
  
  I'll go right ahead then. I was consulting for State Farm installing machines that were supposed to help with the Y2K problem. Hell if I know, I just got the box, went to the site, installed it and made sure it was working. Easy. I had five to do a week, and would be done by Tuesday morning and helping out other contractors on similar projects.
  
  I'll never forget my visit to the State Farm DSO in Detroit, MI. I'd just physically installed the new machine, at the bottom of a rack, and stood up.
  
  Stood up putting my shoulder right into the unprotected "History Eraser Button" on the wall. The screams of the employees working int he datacenter could be heard all the way back home in Chicago, I've no doubt.
  
  Then it turns out the fuses which will reset the systems in the datacenter are in a locked cabinet.
  
  Then it turns out no one on site has a key.
  
  Fortunately, I found that the cabinet will pop open if you kick it hard enough. Hey, I was panicking, okay?
  
  And get this. After it was all over and I realized I probably wouldn't get killed by anyone... they told me "It's okay, this happens all the time. The guy installing the A/C unit last week did it too."
  
  Maybe they should have put a cover over the damn button then. Morons.
  
  Parent Share
  twitter facebook
  - Re:History Eraser Button (Score:2, Funny)
    
    by Local ID10T ( 790134 ) writes:
    
    I was consulting for State Farm installing machines that were supposed to help with the Y2K problem.
    
    Hey! I worked that project too... it was fun, but mindnumbing. They actualy sent me to New Orleans for an install on fat tuesday.
    
    Mardi Gras on an expense account :)
  - - Re:History Eraser Button (Score:3, Funny)
      
      by cgenman ( 325138 ) writes:
      
      If I ever catch anyone putting a cover over a critical piece of safety equipment, like an Emergency Power Cutoff switch, I'll put their head on a pole in front of the data centre as a warning to others.
      
      You of all people should realize that putting someone's head on a pole in front of a data centre is dangerous. For one, it tends to become a disease vector, as for some mysterious reason everyone feels the need to touch it. Rats are usually attracted to the smell, and you know how rats wreak havock on eth
- You mean "The Big Red Button" (Score:2)
  
  by rednip ( 186217 ) writes:
  
  A couple of years ago, when our server room was being 'certified', one of the specific checks was "No, big red button, check". One of the guys in the group came up with a story about how someone's kid at the end of a 'tour' thought that the 'big red button' was ment to be pushed.
- - Blame (Score:2)
    
    by bsd4me ( 759597 ) writes:
    
    Most of the time it is Stimpy's fault. The rest of the time it is Fry's fault. I think there may be a connection...
Perhaps they should answer (Score:2)

by antifoidulus ( 807088 ) writes:

/.s current poll [slashdot.org] now?
Fascinating read (Score:5, Insightful)

by Saint Aardvark ( 159009 ) * writes: on Thursday January 20, 2005 @04:31PM (#11423728) Homepage Journal

It's amazing how much you can learn from things going horribly wrong. :-)
Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.

Share
twitter facebook
- Re:Fascinating read (Score:2)
  
  by caluml ( 551744 ) writes:
  
  Agreed. I always appreciate when people explain how large scale outages happened, were able to happen, how they fix it, and what they do to prevent it happening again. It's useful (and good for your employment status) to learn from other people mistakes rather than your own.
  So Slashdot - what are all the 500 errors about then? :)
Missing opportunities (Score:4, Funny)

by Rosco P. Coltrane ( 209368 ) writes: on Thursday January 20, 2005 @04:33PM (#11423755)

Apparently someone "accidentally" pushed the emergency power off

They had to power back on when they realized deadjournal.com [deadjournal.com] was already taken...

Share
twitter facebook
LJDotting: LJ user base vs Slashdot user base. (Score:5, Funny)

by TrevorB ( 57780 ) writes: on Thursday January 20, 2005 @04:33PM (#11423759) Homepage

If Mr. "I Pushed The Big Red Button"'s personal information ever gets published....

LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.

Share
twitter facebook
Auto-negotiation (Score:4, Informative)

by stilwebm ( 129567 ) writes: on Thursday January 20, 2005 @04:34PM (#11423772)

When I first moved company servers in to a new colo four years ago, their engineers advised me that I should turn auto-negotiation off on every port, including our switches and host NICs. I asked why they recommended this and they replied, "trust us, auto-negotiation causes problems when you least expect it." I went ahead and fixed the port speeds everywhere. Now I understand why.

Share
twitter facebook
- Re:Auto-negotiation (Score:2)
  
  by Malk-a-mite ( 134774 ) writes:
  
  If you know what speed port you are plugging in to why would you need to autoneg?
  
  It's a convenience that isn't always needed.
- Re:Auto-negotiation (Score:5, Insightful)
  
  by jjgm ( 663044 ) writes: on Thursday January 20, 2005 @04:56PM (#11424030)
  
  Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.
  They're cheeky enough to document this [cisco.com] now. It's a feature, not a bug! Honest!
  
  Parent Share
  twitter facebook
  - Re:Auto-negotiation (Score:2, Funny)
    
    by Undertaker43017 ( 586306 ) writes:
    
    The part I like is they are claiming that everyone else is wrong, and they are right. ;)
    
    I don't buy Cisco anymore for this very reason, it's not just their switches, it's on everything they make that has a NIC.
    
    I deployed some CSS's, right after Cisco bought ArrowPoint, and they did auto correctly. Another client deployed some a couple of months ago, and auto was broken. Cisco is the Borg! ;)
  - You, sir, are an idiot. (Score:5, Informative)
    
    by Anonymous Coward writes: on Thursday January 20, 2005 @05:11PM (#11424235)
    
    Go ahead and read up on how auto-negotiation works. I'll wait...
    
    No, really. Go read up on it...
    
    Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.
    
    (To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)
    
    Parent Share
    twitter facebook
...and ran off? (Score:5, Funny)

by stratjakt ( 596332 ) writes: on Thursday January 20, 2005 @04:35PM (#11423784) Journal

What do you mean, ran off?

Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?

My dog likes remote controls more than snausages.

OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?

Share
twitter facebook
- 13 yo? :P (Score:3, Funny)
  
  by Spy der Mann ( 805235 ) writes:
  
  Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?
  
  By any chance, was his name "Zero Cool"?
- - - Re:...and ran off? (Score:2)
      
      by DrHogie ( 8093 ) writes:
      
      9thtee.com and weaknees.com should both sell replacement remotes for TiVo. After one too many drops on our living room's tile floor, it's about time we get a new one ourselves . . .
      
      http://www.weaknees.com/tivo_remotes.php
Credit (Score:5, Informative)

by XorNand ( 517466 ) writes: on Thursday January 20, 2005 @04:38PM (#11423819)

Anyone who's a paid member of LJ can get a 2-week credit here [livejournal.com].

Share
twitter facebook
A great article (Score:2)

by digitalgimpus ( 468277 ) writes:

I must compliment LJ for at least being honest with their system... many would lie and say "it was the datacenter's fault".

They at least admit their own systems weren't perfect... and clearly explained each fault they observed.

Good info.
Ahhhh silence is GOOOOLDEN (Score:4, Funny)

by ShatteredDream ( 636520 ) writes: on Thursday January 20, 2005 @04:41PM (#11423853) Homepage

*crickets chirping* That's the sound millions of teenage girls not using up bandwidth and disk space talking about boys, jcrew and high school/college drama.

Share
twitter facebook
- Re:Ahhhh silence is GOOOOLDEN (Score:2)
  
  by eln ( 21727 ) writes:
  
  Yah, but now we have nerds talking about girls talking about boys, jcrew, and high school/college drama. I shudder to think what would happen if Slashdot had an outage like that right now.
- Re:Ahhhh silence is GOOOOLDEN (Score:4, Funny)
  
  by metalhed77 ( 250273 ) writes: <andrewvc@gmaCOUGARil.com minus cat> on Thursday January 20, 2005 @05:23PM (#11424396) Homepage
  
  So says the author of yet another political weblog whose startling impartialiality and sense will pave the way for a brave new world?
  
  Parent Share
  twitter facebook
machine failure (Score:4, Insightful)

by br00tus ( 528477 ) writes: on Thursday January 20, 2005 @04:41PM (#11423858)

"They had problems to come back up fast, because of '9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others.'"
I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.

Share
twitter facebook
- Re:machine failure (Score:5, Insightful)
  
  by rjstanford ( 69735 ) writes: on Thursday January 20, 2005 @04:47PM (#11423933) Homepage Journal
  
  One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.
  
  Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.
  
  Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?
  
  Parent Share
  twitter facebook
- Re:machine failure (Score:3, Insightful)
  
  by gkuz ( 706134 ) writes:
  
  Every Saturday evening, we rebooted all of our servers
  Yeah, we had servers like that once, too. Ba-da-bing! Thanks, I'll be here all week.
  On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane? We've had critical, and I mean critical, servers that have uptimes measured in years. But then again they run NetWare, or OS/400, or MVS, or.... ABW.
  Scheduled reboots are a part of good systems administration
  Yeah, scheduled, as part of a disaster re
  - Re:machine failure (Score:2)
    
    by Saeed al-Sahaf ( 665390 ) writes:
    
    Scheduled reboots are a part of good systems administration
    He's talking about Windows, where regular reboots are a good thing when they are planned, so you don't have regular reboots when they are NOT planned!
  - - Re:machine failure (Score:2)
      
      by gkuz ( 706134 ) writes:
      
      it is also will the hardware live through a power cycle
      Why should it have to? If it's a critical server, your infrastructure should be such that it never power cycles. Our computer room has "power cycled" once since the facility was built in 1984. And that incident led to spending $65k in consulting engineering services alone, to determine why it happened and develop a plan to prevent it happening again. I'm not even sure what the expenditure in hardware or electrical contracting related to that was. I gu
- - Re:machine failure (Score:3, Insightful)
    
    by TeraCo ( 410407 ) writes:
    
    You sir, sound like a man who needs a load balanced cluster. If you're relying on individual boxes staying up to meet your SLA's, your career is a ticking timebomb.
LOL! Kindof like when... (Score:5, Funny)

by GillBates0 ( 664202 ) writes: on Thursday January 20, 2005 @04:42PM (#11423873) Homepage Journal

...when I was on AOL and I hit the X and I couldn't talk to my AOL Buddies anymore.
And I was like OMG I shut off the internets and stuff!!1!!
And i called the AOL helpdesk and they helped turn it back on.

Share
twitter facebook
- Re:LOL! Kindof like when... (Score:2)
  
  by Saeed al-Sahaf ( 665390 ) writes:
  
  When you tried to turn it back on, did it go, like, "beep, beep, beep"?
And here (Score:2)

by OverlordQ ( 264228 ) writes:

everybody was blaming Internap for screwing up and running a shoddy Datacenter, when actually Internap did everything they were supposed to correctly.
- Re:And here (Score:3, Interesting)
  
  by tmhsiao ( 47750 ) writes:
  
  Aside from allowing an unaccompanied client access to the Big Red Button, perhaps?
Also, (Score:2)

by revery ( 456516 ) writes:

Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS)

This also raised the all-important "Why do we even have that button?" question.
- Re:Also, (Score:3, Informative)
  
  by Scott Laird ( 2043 ) writes:
  
  "Why do we even have that button?" Because it's basically required by law. Covering them with a plastic cover doesn't seem to help either--Internap did that the *last* time someone hit the EPO button in this datacenter.
- Re:Also, (Score:2)
  
  by merlin_jim ( 302773 ) writes:
  
  This also raised the all-important "Why do we even have that button?" question.
  
  Those buttons are generally maintenance devices; it's usually less of a button and more of a keyswitch though. So the guy comes in to service something, he needs to know that no power is anywhere in there, so he removes the key and keeps it in his pocket. Now he knows he's safe.
- Re:Also, (Score:2)
  
  by Peridriga ( 308995 ) writes:
  
  It's the law. It's also in the article.
  
  EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center
- Re:Also, (Score:2)
  
  by revery ( 456516 ) writes:
  
  I keep forgetting that this is slashdot. I shouls bave put in my disclamer:
  
  Please, do not be alarmed or reply with an explanation. This is a joke. I am joking. You have been joked with.
  
  Sigh...
- No, it did not (Score:2)
  
  by EvilStein ( 414640 ) writes:
  
  They're required by law to have it. It's a building code thing. Every data center I've ever been in has one.
  
  Also.. ""EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."
Button of Doom (Score:2)

by clinko ( 232501 ) writes:

Maybe they should use the Button of Doom [clinko.com] (USB) to lock the pcs down too...
Wait a second! (Score:2)

by Sialagogue ( 246874 ) writes:

"EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."
"...all our DBs have redundant power supplies. we'll be plugging one side into Internap's, and the other side into our own UPS, which itself is plugged into Internap's other power grid. that way if EPO is pressed, we'll have 1-4 minutes to do a clean shutdown. (but if we do the rest of
- Re:Wait a second! (Score:3, Informative)
  
  by rah1420 ( 234198 ) writes:
  
  Technically, yes. I'm hoping that if LJ decides to implement such a scheme (let's call it "LEPO" for "Leisurely Emergency Power Off") that they run it past the fire marshal or the code inspectors first, who may have another opinion about how smart this idea is.
  
  "If it's stupid and it works, it's not stupid."
- Re:Wait a second! (Score:3, Interesting)
  
  by psykocrime ( 61037 ) writes:
  
  Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?
  
  It's not so much that the firefighters spraying water are worried about getting electrocuted via current conducting through the water itself... it's more about worrying bout stumbling i
- - Re:Wait a second! (Score:2)
    
    by Sialagogue ( 246874 ) writes:
    
    Sorry, but you confused me.
    It seemed as though they were talking in the article about putting a separate, independent UPS system in place for their machines, that are independent of the EPO system. It sounds to me like that would keep their machines on for four minutes even after one or both of the facilities EPO systems have been triggered creating an electrocution danger.
    Are you suggesting that their UPS would have a separate EPO just for it? I don't think that's the case, because they specifically men
The reason why some NICs don't auto-neg (Score:3, Informative)

by phaetonic ( 621542 ) * writes: on Thursday January 20, 2005 @04:48PM (#11423942)

I have run across this issue in data centers numerous times. This still occurs with the latest hardware, no matter what vendor or OS. I have this problem on SunFire280Rs and Compaq DL360s. What it comes down to is the switch being used in the data center and the settings in the OS. Typically, data centers set their switch to forced 100-full (unless of course they are using fibre or Gb). The OS must be set to force its NICs in the same mode, or they will either drop alot of packets. Sounds like a disconnect in communications between the NOC and the customer.

Share
twitter facebook
- Re:The reason why some NICs don't auto-neg (Score:3, Informative)
  
  by caluml ( 551744 ) writes:
  
  That's what Compaq Lights-Out cards are for. Lovely things. Very handy.
OOB console access is the answer. (Score:3, Insightful)

by Mordant ( 138460 ) writes: on Thursday January 20, 2005 @04:52PM (#11423981)

They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.

Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.

Share
twitter facebook
HAH! (Score:2)

by rah1420 ( 234198 ) writes:

I told you so. [slashdot.org]

Looks like my "Newbie Operator" found hisself a new job.
2 accounts of the powerloss (Score:5, Funny)

by Spazholio ( 314843 ) writes: <slashdot AT lexal DOT net> on Thursday January 20, 2005 @04:56PM (#11424028) Homepage

The one [livejournal.com] they tell you about and the real [livejournal.com] one.

Share
twitter facebook
No! (Score:3, Insightful)

by Saeed al-Sahaf ( 665390 ) writes: on Thursday January 20, 2005 @04:57PM (#11424031) Homepage

embedded NICs...
Who in their right mind goes with the on-board NIC in a server environment?

Share
twitter facebook
- Re:No! (Score:3, Interesting)
  
  by juuri ( 7678 ) writes:
  
  Who in their right mind goes with the on-board NIC in a server environment?
  
  Are you kidding?
  
  How about everyone? Regardless of PC, Sun, Alpha or whatever hardware.
  - Re:No! (Score:2)
    
    by Saeed al-Sahaf ( 665390 ) writes:
    
    Does not mean it's a good idea! Not a single machine where I work uses the on-board NIC, from servers down to desktops. And all of our machines have a two year lifecycle, tops. We generally plug in a 3Com card of some type.
    - Re:No! (Score:2, Informative)
      
      by SenorChuck ( 457914 ) writes:
      
      On all of the (actual) servers I've worked with, the onboard NICs are exactly the same hardware that you get with the server-grade PCI NICs.
    - Nothing wrong with onboard NICs in "real" servers. (Score:2, Informative)
      
      by Nonesuch ( 90847 ) writes:
      
      Does not mean it's a good idea! Not a single machine where I work uses the on-board NIC, from servers down to desktops. And all of our machines have a two year lifecycle, tops. We generally plug in a 3Com card of some type.
      
      The smallest of the Sun 1U rackmount Sparc servers do not even have a PCI slot to take a NIC -- no expansion at all, but two on-board 100M interfaces are plenty for most data center deployments of these small boxes.
No UPSes before? (Score:2)

by iabervon ( 1971 ) writes:

I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.
- Re:No UPSes before? (Score:3, Informative)
  
  by Nonesuch ( 90847 ) writes:
  
  I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.
  
  Unfortunately, this would defeat the purpose of the "Big Red Button", which
Accidents happen (Score:3, Interesting)

by Migraineman ( 632203 ) writes: on Thursday January 20, 2005 @05:07PM (#11424176)

About a decade ago, we had a series of "incidents" with the EPO button in the software lab. Shortly after a serious lab upgrade (due to constantly blowing breakers,) someone decided to test the EPO switch (it was a bit of a novelty at the time.) *click* "Cool, it works. Hey, how do you reset this thing?" Turns out you needed to have a key to reset it. It took about 4 hours to find someone who had the key. That one got replaced with the Mark II resetable switch ...

About a month later, one of the managers was giving a prospective new-hire a tour. He got to the software lab, and started blathering about "don't ever push the red switch" as he put his finger on the switch ... *click*

So some einstein decided that the Big Red Switch was "dangerous" and put a plexi cover over it - the same kind that goes over the thermostat control, and the same kind that has a key lock. Yep, about six months later we had a gen-you-ine emergency. One of the HP 9000/300 monitors went crispy, and was snorting smoke and sparks. One of the software folks went to hit the Big Red Button, but was somewhat nonplussed to find a locking cover over it. She took the co-located fire bottle, sheared the cover off, pressed the button, then got to use said fire bottle on the monitor.

So the cover gets replaced again, though this time with a non-locking cover. At some point, the software server stack needed to be relocated into the corner with the Big Red Button. Another einstein discovered that it was inconvenient to slink behind the equipment rack - the cover kept bashing him in the neck or shoulder. So he removed it, thinking that accidental presses wouldn't happen because the button was obstructed by the server stack. (yep, inaccessible = useless.) Some time later, the equipment was being jockeyed for an upgrade, and one of the big SCSI cables snagged the Big Red Button and *click* ...

All these shenanigans happened in the space of one year, and I got tired of the thrash. I measured the space between the back of the switch and the faceplate - just over 3/4 inch. I cut a horseshoe shape out of 3/4 plywood, and hung it on the switch shaft. In and emergency, it's really easy (and obvious) to remove it. Gravity keeps it there otherwise. No problems since ...

Share
twitter facebook
LJ IS TEH LITTLE GIRL HOLE! (Score:2)

by Turn-X Alphonse ( 789240 ) writes:

Maybe people will see this and relise the LJ staff are geeks, unlike most of their fanbase, so while you maybe mocking their minions they can still bring down a server looking at a single article with the rest of us slashdotters.
happened to us (Score:2)

by bwindle2 ( 519558 ) writes:

We have one of those Big Red Buttons in our datacenter (about 7 feet up on the wall, so no one could accident bump into it). About a year after it was installed, an electrician showed up to do something in the ceiling, and accident leaned his ladder up against our exposed Big Red Button.
Needless to say, we now have a cover over our Button. Funny thing is, the electrician who installed the original button is also the guy who leaned his ladder against it.
- Re:happened to us (Score:2)
  
  by TomHandy ( 578620 ) writes:
  
  Wait a minute............. that's not funny at all!
This is what happens... (Score:2)

by MsGeek ( 162936 ) writes:

...when you buy crappy [pcchipsusa.com] kit [ecsusa.com]. Next time do [asus.com] it [ibm.com] right [apple.com].
Cabling? (Score:2)

by redelm ( 54142 ) writes:

OK, this _shouldn't_ apply to a good, reputable datacenter that has structured wiring to TIA/EIA-568 running gigabit.
I most often see autoneg problems with faulty cabling (split pairs from crimps). 98% of newbies cannot get it right, and they aren't to blame because the standards are counter-intuitive unless you've worked for Ma Bell for 40+ years. I beware of all field crimps.
OTOH, I saw one example of a Crisco Crapalyst router not wanting to play with some devices. Of course they blamed the device, bu
The result. (Score:2)

by Pathetic Coward ( 33033 ) writes:

(a) Manager that pushed the "off" button gets promoted.
(b) Engineers that spent their weekends getting the system back up: off to India with your jobs!
Make the Luser pay (Score:2)

by sconeu ( 64226 ) writes:

I assume that they will have the responsible luser pay for the down time plus the 2 weeks credit plus the extra hours for the staff to bring the system up.

And what the hell was a visitor doing playing with the Big Red Button anyways?
Big Red Button (Score:2)

by prizog ( 42097 ) writes:

If it looks like this, don't push it! [livedoor.com]
It happens (Score:2)

by boodaman ( 791877 ) writes:

This happened to us last year in our datacenter.

The Facilities manager had some guys in to install shelving to store toner, cables, etc.

Our datacenter is divided into two sections, inner and outer. All CPUs, UPSs, HVAC, etc are in the inner room. The outer room is shelving, desks, CCTV (security), etc.

The EPOs are near every door, as they should be, including the outer doors. Some guy, while installing the shelves, decided to take a little break and lean against the wall, leaning on the EPO in the proce
- Re:Where was the switch? (Score:2, Informative)
  
  by grub ( 11606 ) writes:
  
  They usually are in a server room. They're for emergencies. Ours have red cages around them and a BIG RED SIGN, you have to basically punch them.
- Re:Where was the switch? (Score:2)
  
  by crimoid ( 27373 ) writes:
  
  Typically these types of devices are just inside the door to the rooms that they cut off. This way Fire & Emergency personnel can get to them quickly and easily.
  
  Generally the buttons themselves are behind plexiglass lids that easily flop up or behind breakable glass.
- Re:Where was the switch? (Score:2)
  
  by bsd4me ( 759597 ) writes:
  
  These switches are generally big round buttons about 2" in diameter, and almost always made out of bright red plastic. On top of that, the button take some force to depress and many facilities place a hinged, clear plexiglass box over them to prevent accidental use. It is pretty hard to mistake one for a normal light switch.
- - - Re:Where was the switch? (Score:2)
      
      by irc.goatse.cx troll ( 593289 ) writes:
      
      Or your gun, unless you want to ask that kind man with a knife to wait while you dig out the key.
- Re:I want to name this file..... (Score:2, Funny)
  
  by Cocoronixx ( 551128 ) writes:
  
  uhhh 0? Well I guess 1 since I can count you now.
- Re:I want to name this file..... (Score:2)
  
  by shuz ( 706678 ) writes:
  
  That is why I try to always include "" around everything I do in any unix environment and leave off trailing /'s when ever possible.
- Re:How do you do that by *accident*???? (Score:2, Funny)
  
  by FudgePackinJesus ( 444734 ) writes:
  
  Stimpy couldn't resist "The Red, Shiney, CANDY-LIKE Button!!"
- Re:How do you do that by *accident*???? (Score:2, Funny)
  
  by AndroidCat ( 229562 ) writes:
  
  Another customer in the facility accidentally pressed the EPO button, then depressed it
  I'm trying to figure out how depressing a button reverses a press. (Since the button is depressed by pressing it.) Unpressed it?
- They're attention whoring (Score:2)
  
  by EvilStein ( 414640 ) writes:
  
  Plain and simple. People notice a "historical post" and they want to have their LJ face right up there in it.
  
  Total kissasses. I wonder how many of them are paid members vs free accounts.
  Remember, the overwhelming majority of Livejournal users are *NOT* paying customers...
  
  Account Types
  
  What type of account do people have?
  
  * Free Account: 5713743 (98.3%)
  * Early Adopter: 14220 (0.2%)
  * Paid Account: 94857 (1.6%)
  * Permanent Account: 1632 (0.0%)
- - Re:Don't forget... (Score:2)
    
    by vadim_t ( 324782 ) writes:
    
    Nonsense. I had my server up for 360 days without rebooting, with kernel 2.4. It had 360 days on the uptime counter. I only shut it down because it was too slow for the newer stuff I wanted to run.
- Re:Its a Small World... (Score:3, Funny)
  
  by radish ( 98371 ) writes:
  
  LiveJournal got hit the hardest, they had some IDE drives on their servers, doh!
  
  I was unaware that SCSI drives had the ability to run without power - thanks for the info!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Lesser OS... (Score:5, Funny)

Re:Lesser OS... (Score:2, Informative)

Re:Lesser OS... (Score:2, Insightful)

The less we've learned... (Score:5, Funny)

Re:The less we've learned... (Score:2)

Re:The less we've learned... (Score:2)

Re:The less we've learned... (Score:3, Funny)

Big red buttons (Score:2)

Re:The less we've learned... (Score:3, Funny)

Re:The less we've learned... (Score:2)

Photo of the button (Score:2, Insightful)

faulty mobo's (Score:5, Interesting)

Re:faulty mobo's (Score:2)

Not millions of paying accounts. (Score:5, Informative)

Re:faulty mobo's (Score:2)

Re:faulty mobo's (Score:2)

Re:faulty mobo's (Score:2)

503 pages (Score:2)

Re:503 pages (Score:2)

Oppsie (Score:5, Funny)

Re:Oppsie (Score:2)

Re:Oppsie (Score:2)

History Eraser Button (Score:5, Funny)

Re:History Eraser Button (Score:5, Interesting)

Re:History Eraser Button (Score:2, Funny)

Re:History Eraser Button (Score:3, Funny)

You mean "The Big Red Button" (Score:2)

Blame (Score:2)

Perhaps they should answer (Score:2)

Fascinating read (Score:5, Insightful)

Re:Fascinating read (Score:2)

Missing opportunities (Score:4, Funny)

LJDotting: LJ user base vs Slashdot user base. (Score:5, Funny)

Auto-negotiation (Score:4, Informative)

Re:Auto-negotiation (Score:2)

Re:Auto-negotiation (Score:5, Insightful)

Re:Auto-negotiation (Score:2, Funny)

You, sir, are an idiot. (Score:5, Informative)

...and ran off? (Score:5, Funny)

13 yo? :P (Score:3, Funny)

Re:...and ran off? (Score:2)

Credit (Score:5, Informative)

A great article (Score:2)

Ahhhh silence is GOOOOLDEN (Score:4, Funny)

Re:Ahhhh silence is GOOOOLDEN (Score:2)

Re:Ahhhh silence is GOOOOLDEN (Score:4, Funny)

machine failure (Score:4, Insightful)

Re:machine failure (Score:5, Insightful)

Re:machine failure (Score:3, Insightful)

Re:machine failure (Score:2)

Re:machine failure (Score:2)

Re:machine failure (Score:3, Insightful)

LOL! Kindof like when... (Score:5, Funny)

Re:LOL! Kindof like when... (Score:2)

And here (Score:2)

Re:And here (Score:3, Interesting)

Also, (Score:2)

Re:Also, (Score:3, Informative)

Re:Also, (Score:2)

Re:Also, (Score:2)

Re:Also, (Score:2)

No, it did not (Score:2)

Button of Doom (Score:2)

Wait a second! (Score:2)

Re:Wait a second! (Score:3, Informative)

Re:Wait a second! (Score:3, Interesting)

Re:Wait a second! (Score:2)

The reason why some NICs don't auto-neg (Score:3, Informative)

Re:The reason why some NICs don't auto-neg (Score:3, Informative)

OOB console access is the answer. (Score:3, Insightful)

HAH! (Score:2)

2 accounts of the powerloss (Score:5, Funny)

No! (Score:3, Insightful)

Re:No! (Score:3, Interesting)

Re:No! (Score:2)

Re:No! (Score:2, Informative)

Nothing wrong with onboard NICs in "real" servers. (Score:2, Informative)

No UPSes before? (Score:2)

Re:No UPSes before? (Score:3, Informative)

Accidents happen (Score:3, Interesting)

Re:How do you do that by accident???? (Score:2, Funny)

Re:How do you do that by accident???? (Score:2, Funny)