Slashdot Log In
LiveJournal Blackout Analysis Online
Posted by
CmdrTaco
on Thu Jan 20, 2005 03:27 PM
from the when-it-all-hits-the-fan dept.
from the when-it-all-hits-the-fan dept.
Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday.
Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Lesser OS... (Score:5, Funny)
They should be using OpenBSD. It can run right through power failures [grub.net]
The less we've learned... (Score:5, Funny)
Re:The less we've learned... (Score:2)
Re:The less we've learned... (Score:3, Funny)
Chris Mattern
Re:The less we've learned... (Score:3, Funny)
faulty mobo's (Score:5, Interesting)
Re:faulty mobo's (Score:2)
Not millions of paying accounts. (Score:5, Informative)
This is a paste from the Livejournal stats:
* Free Account: 5713743 (98.3%)
* Early Adopter: 14220 (0.2%)
* Paid Account: 94857 (1.6%)
* Permanent Account: 1632 (0.0%)
Parent
503 pages (Score:2)
I havent seen them that much lately, but then i havent been online that much either...
Re:503 pages (Score:2)
You get 503 sites? I only reach one at slashdot.org
Then again, you're a subscriber. Who knows what goodies you lucky few get here...
Oppsie (Score:5, Funny)
"Oppsie, I hope that button wasn't anything important."
Re:Oppsie (Score:2)
History Eraser Button (Score:5, Funny)
Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...
Re:History Eraser Button (Score:5, Interesting)
I'll never forget my visit to the State Farm DSO in Detroit, MI. I'd just physically installed the new machine, at the bottom of a rack, and stood up.
Stood up putting my shoulder right into the unprotected "History Eraser Button" on the wall. The screams of the employees working int he datacenter could be heard all the way back home in Chicago, I've no doubt.
Then it turns out the fuses which will reset the systems in the datacenter are in a locked cabinet.
Then it turns out no one on site has a key.
Fortunately, I found that the cabinet will pop open if you kick it hard enough. Hey, I was panicking, okay?
And get this. After it was all over and I realized I probably wouldn't get killed by anyone... they told me "It's okay, this happens all the time. The guy installing the A/C unit last week did it too."
Maybe they should have put a cover over the damn button then. Morons.
Parent
Re:History Eraser Button (Score:3, Funny)
You of all people should realize that putting someone's head on a pole in front of a data centre is dangerous. For one, it tends to become a disease vector, as for some mysterious reason everyone feels the need to touch it. Rats are usually attracted to the smell, and you know how rats wreak havock on eth
Perhaps they should answer (Score:2)
Fascinating read (Score:5, Insightful)
Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.
Re:Fascinating read (Score:2)
So Slashdot - what are all the 500 errors about then?
Missing opportunities (Score:4, Funny)
They had to power back on when they realized deadjournal.com [deadjournal.com] was already taken...
LJDotting: LJ user base vs Slashdot user base. (Score:5, Funny)
LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.
Auto-negotiation (Score:4, Informative)
Re:Auto-negotiation (Score:5, Insightful)
Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.
They're cheeky enough to document this [cisco.com] now. It's a feature, not a bug! Honest!
Parent
You, sir, are an idiot. (Score:5, Informative)
No, really. Go read up on it...
Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.
(To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)
Parent
...and ran off? (Score:5, Funny)
Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?
Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?
My dog likes remote controls more than snausages.
OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?
13 yo? :P (Score:3, Funny)
By any chance, was his name "Zero Cool"?
Credit (Score:5, Informative)
A great article (Score:2)
They at least admit their own systems weren't perfect... and clearly explained each fault they observed.
Good info.
Ahhhh silence is GOOOOLDEN (Score:4, Funny)
Re:Ahhhh silence is GOOOOLDEN (Score:4, Funny)
Parent
machine failure (Score:4, Insightful)
I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.
Re:machine failure (Score:5, Insightful)
Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.
Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?
Parent
Re:machine failure (Score:3, Insightful)
Yeah, we had servers like that once, too. Ba-da-bing! Thanks, I'll be here all week.
On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane? We've had critical, and I mean critical, servers that have uptimes measured in years. But then again they run NetWare, or OS/400, or MVS, or.... ABW.
Scheduled reboots are a part of good systems administration
Yeah, scheduled, as part of a disaster re
Re:machine failure (Score:3, Insightful)
LOL! Kindof like when... (Score:5, Funny)
And I was like OMG I shut off the internets and stuff!!1!!
And i called the AOL helpdesk and they helped turn it back on.
The reason why some NICs don't auto-neg (Score:3, Informative)
Re:The reason why some NICs don't auto-neg (Score:3, Informative)
OOB console access is the answer. (Score:3, Insightful)
Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.
2 accounts of the powerloss (Score:5, Funny)
No! (Score:3, Insightful)
Who in their right mind goes with the on-board NIC in a server environment?
Re:No! (Score:3, Interesting)
Are you kidding?
How about everyone? Regardless of PC, Sun, Alpha or whatever hardware.
Accidents happen (Score:3, Interesting)
About a month later, one of the managers was giving a prospective new-hire a tour. He got to the software lab, and started blathering about "don't ever push the red switch" as he put his finger on the switch
So some einstein decided that the Big Red Switch was "dangerous" and put a plexi cover over it - the same kind that goes over the thermostat control, and the same kind that has a key lock. Yep, about six months later we had a gen-you-ine emergency. One of the HP 9000/300 monitors went crispy, and was snorting smoke and sparks. One of the software folks went to hit the Big Red Button, but was somewhat nonplussed to find a locking cover over it. She took the co-located fire bottle, sheared the cover off, pressed the button, then got to use said fire bottle on the monitor.
So the cover gets replaced again, though this time with a non-locking cover. At some point, the software server stack needed to be relocated into the corner with the Big Red Button. Another einstein discovered that it was inconvenient to slink behind the equipment rack - the cover kept bashing him in the neck or shoulder. So he removed it, thinking that accidental presses wouldn't happen because the button was obstructed by the server stack. (yep, inaccessible = useless.) Some time later, the equipment was being jockeyed for an upgrade, and one of the big SCSI cables snagged the Big Red Button and *click*
All these shenanigans happened in the space of one year, and I got tired of the thrash. I measured the space between the back of the switch and the faceplate - just over 3/4 inch. I cut a horseshoe shape out of 3/4 plywood, and hung it on the switch shaft. In and emergency, it's really easy (and obvious) to remove it. Gravity keeps it there otherwise. No problems since
Re:Where was the switch? (Score:2, Informative)
They usually are in a server room. They're for emergencies. Ours have red cages around them and a BIG RED SIGN, you have to basically punch them.
Re:Where was the switch? (Score:2)
Generally the buttons themselves are behind plexiglass lids that easily flop up or behind breakable glass.
Re:I want to name this file..... (Score:2, Funny)
Re:And here (Score:3, Interesting)
Re:Wait a second! (Score:3, Informative)
"If it's stupid and it works, it's not stupid."
Re:Wait a second! (Score:3, Interesting)
It's not so much that the firefighters spraying water are worried about getting electrocuted via current conducting through the water itself... it's more about worrying bout stumbling i
Re:Also, (Score:3, Informative)
Re:No UPSes before? (Score:3, Informative)
Unfortunately, this would defeat the purpose of the "Big Red Button", which
Re:Its a Small World... (Score:3, Funny)
I was unaware that SCSI drives had the ability to run without power - thanks for the info!