LiveJournal Servers Go Down 596
Wind writes "According to any journal hosted off of LiveJournal.com, the LiveJournal data center Internap has suffered a critical power failure, leaving all of LiveJournal and its content temporarily offline and requiring the revival of 100+ servers. Perhaps Six Apart wasn't quite prepared for the responsibilities of a website of this size? Updated information is posted here."
Lights out (Score:5, Funny)
The Pain ... (Score:5, Funny)
I need my fix, man!
Re:The Pain ... (Score:2)
Re:The Pain ... (Score:4, Funny)
Re:The Pain ... (Score:2)
A great disturbance in the Force... (Score:5, Funny)
Re:A great disturbance in the Force... (Score:3, Funny)
hopefully the 'power outage' that took out lj was actually something cool like someone sneaking an emp bomb into the datacenter. and not some dipshit power company employee hooking up something wrong on a transformer outside and melting the lines.
cheers.
Re:A great disturbance in the Force... (Score:3, Funny)
>It's cute to see such naivety still on the internet. Never played any MMORPGs, huh?
This is different to MMORPGs, MMORPGs are generally a male domain, with men pretending to be women to get favours. On the other hand, blogs involve things women like doing, i.e. fucking going on and on about shit no-one cares about.
Comment removed (Score:5, Funny)
Re:The Pain ... (Score:5, Funny)
Re:The Pain ... (Score:5, Funny)
Re:The Pain ... (Score:2)
It's strange (Score:3, Funny)
Elsewhere (Score:4, Funny)
Re:Elsewhere (Score:3, Funny)
I call bull on all this (Score:3, Insightful)
Re:Elsewhere (Score:5, Funny)
In other news... (Score:5, Funny)
Re:In other news... (Score:2, Funny)
So just imagine the result of when Slashdot dies!
Tomorrow's news (Score:2, Funny)
Re:Tomorrow's news (Score:2, Funny)
Re:In other news... (Score:2)
Sucks for LiveJournal... (Score:2, Funny)
slashdot has repeated 503 errors, (Score:5, Insightful)
Re:slashdot has repeated 503 errors, (Score:5, Insightful)
Slashdot has semi-major problems almost every day. 503 errors, "nothing for you to see here" annoyances, and a search engine that goes down more than a Thai hooker.
Re:slashdot has repeated 503 errors, (Score:2, Funny)
Oh man, that would be one *fantastic* search engine! Is there a Google beta for this?
Re:slashdot has repeated 503 errors, (Score:4, Informative)
That said, the story submitter is clearly trolling himself, as neither 6A's nor LJ's staff had anything to do with the massive power failure at their co-lo.
Internap is *down*? (Score:5, Informative)
Bush just appointed Internap's CEO to his National Infrastructure Advisory Council [tmcnet.com], yet the man can't keep a co-lo facility switched on.
I'm not sure what that says of Bush or of Interap. And it certainly doesn't seem to have anything to do with SixApart.
Re:Internap is *down*? (Score:5, Interesting)
Re:Internap is *down*? (Score:4, Interesting)
Re:Internap is *down*? (Score:3, Informative)
The UPS gear in Internap's space is all top-of-the-line big datacenter grade stuff. Apparently there was some sort of wiring fault in one of the new UPSes they were bringing online that caused both building power to fail and the self-protection ci
Re:Internap is *down*? (Score:4, Insightful)
That being said, I think you didn't quite understand what I was trying to say. I really don't care whether they have "plenty of backup power", "plenty of generator capacity" and "top-of-the-line big datacenter grade stuff" (which really sounds more like a collection of buzzwords than anything else, anyway). If a wiring fault (of whatever kind) can bring up the entire UPS system as well as the "generator capacity behind that" and all other safeguards they supposedly had in place as well, then it's just worthless and a waste of money - a UPS is supposed to be an *uninterrupted* power supply.
And while I admit that it's not possible to guard against *all* problems, saying that the colo facility is "one of the most solid in the state" and supposedly can't be taken offline by something "short of a direct strike from a comet" is just silly when a "wiring failure" can bring down the whole thing, and even more so when it's not the first time that happens.
Really, this just stinks of an attitude that's all too prevalent in parts of the IT industry - just piecing together the components of a reliable system won't necessarily give you one, and if you can't build one properly, then don't go advertising that you have one. Don't you think the fact that the LJ people are now planning to buy their own UPS equipment to use on top of the facility's should tell you something?
Oh, and regarding six nines of uptime - I don't think you actually realize for how little downtime that actually would allow. It's about 30 seconds per year, and Livejournal has been down for at least 16 hours, which corresponds to an uptime of about 99.8% - only two nines left. They probably (hopefully!) won't fall down to one, but things are bad enough as it is, and I, at least, fully blame Internap for that (and, again, I'm a paying user on LJ, so I reserve the right to do just that. ^_~)
Re:Internap is *down*? (Score:2)
by Anonymous Coward on Friday January 14, @10:52PM (#11370963)
It says nothing of Bush or Internap. It says everything about cheapskate blog admins who think they can run servers without paying for battery backup.*
not really. 100+ servers going down and getting major downtime because turning them on is slow is a major cockup. what would be intresting to know would be if some of those 100+ servers were indeed on some plan or another that implied they'd be on battery backup.
Re:Internap is *down*? (Score:3, Interesting)
"Our data center (Internap) lost all its power, including redundant backup power, for some unknown reason. (unknown to me, at least) We're currently dealing with bringing our 100+ servers back online. Not fun. We're not happy about this. Sorry...
Update #1, 7:35 pm PST: we're up on 'dirty' power for now (it works, but it's unreliable), and we're working to assess the state of the databases. The worst thing we could do right now
Re:Bush supporter too dumb to understand datacente (Score:3, Interesting)
You can have all the great redundant mains and backups you want, and it's for shit if you only have one power line to the system and that power bus loses juice.
What a cock (Score:5, Insightful)
Perhaps shit happens, and a blog service doesn't warrant the necessary investment to survive whatever caused this outage?
Re:What a cock (Score:2)
but internap didn't deliver.
read the site.. they were on redundant power.. which turned out to not be that redundant after all(only possible explanation really is a major cockup by someone...).
Re:What a cock (Score:3, Insightful)
Then they'd either need multi-gigabit bandwidth between the two co-los (which would probably cost for a week what they make per year), or they'd have to make separate, semi-independent communities. Google's servers don't stay in sync - you get different results according to which servers you hit, which isn't something you can do with "live" journals.
Re:What a cock (Score:2, Insightful)
THIS doesn't reflect poorly on them. their licensing scheme for movabletype does.
Internap Sucks (Score:4, Interesting)
I seem to remember that a few years back they had a similar problem (Internap lost all power) and it turned out that some idiot had hit the big red "shut down all power to the entire datacenter" emergency button. This isn't the first time this has happened, and last time it wasn't under Six Apart's management.
I'd say it's Internap's incompetence that caused this problem. If they can't keep their datacenter running even though they have multiple redundant power supplies then something is very wrong. I see from the outage page that LJ people are now planning to buy their own UPS so that they don't have to trust Internap anymore.
For power outages, my house has a better record than Internap right now, and I don't even own a UPS!
Re:Internap Sucks (Score:3, Informative)
I don't know what happened this time, but the ~2002 Internap Seattle outage was caused by an idiot Speakeasy tech who couldn't figure out how to use the exit door, so
i don't get it (Score:2, Funny)
sounds like good news (Score:4, Funny)
Was that really called for? (Score:3, Insightful)
Ok, I understand that you don't like Six Apart; I'm no fan of their new licensing scheme either. However, I really doubt that SixApart has any control over any power failures that might occur at Internap.
Re:Was that really called for? (Score:2)
Re:Was that really called for? (Score:2, Informative)
ANGST!!! (Score:2, Funny)
Oh. Slashdot.
That's what you get (Score:2, Funny)
Al Borland was nailing Heidi behind the stage when the outage occured.
Where were the APC backups?
Please, Please, Please! (Score:4, Informative)
A disturbance I feel (Score:4, Funny)
I wonder (Score:2)
Re:I wonder (Score:2)
Melodramatic (Score:2, Funny)
You can just imagine it... (Score:2)
Actual Link (Score:2)
poor internap (Score:3, Interesting)
Seriously though, spammer-nap is a massive spam haus, see for yourself [spamhaus.org]
Disclaimer: I am Not an Electrical Engineer (Score:5, Informative)
I know nothing of how InterNap is set up. I just want to throw that out there ahead of time. Now, it's time for my patent pending "Bull Shit Theory of the Day."
Ok, here is the rant. I used to work for a Colocation facility. Nothing special, small by Telco terms. The whole facility only had about 1500 cabinets. (Though I hear they are now full, and going to be expanding.)
We had a main power draw off of the local grid. We had a backup power draw off of the *next* cities power grid. (ie, when all the offices around us went dark, we still had power.) And you don't even want to know the kind of red tape we had to go through for *that* pull. I'm still not sure how they did it. We had fly wheel kinetic electricity storage systems, battery backups, and a diesel engine from a train so large it had it's own building.
We used to joke that if we lost power, we had more important things to worry about. And again, we were small time compared to some of the massiveness that is out there. *cough*AADS Chicago*cough*
So I'm kind of in agreement with the statement currently on LiveJournal. It's unknown to me how any self respecting colo facility can say "We've had a power outage that also took our redundant systems."
I have to call bullshit on that entire train of thought. If that's true then they don't *have* any redundant systems, and I'd be looking for a new provider. The most likely thing (at least in my mind) is that someone, somewhere got mad at something specific and decided to make a point by popping the main breaker to their portion of the facility.
Oh, that was another thing, each room had several "main" breakers. It took a hell of a power surge to pop all of them, and the Liebert systems had power filters of some kind, really really big capacitors or something I think, so a surge really never made it to the other side anyway, it got stored in the cap and then trickled out like the rest of the power.
But I was a UNIX admin, not the EE that was planning the power generation aspects of the facility. So take some of it with grains of what ever white powdered spice you prefer.
Re:Disclaimer: I am Not an Electrical Engineer (Score:2, Insightful)
Re:Disclaimer: I am Not an Electrical Engineer (Score:5, Interesting)
I recently (~8 months back) did some contract work for a small company whose servers were based in some colo facility in San Francisco. One of the first things I noticed was a damn heavy UPS at the bottom of their rack. Weird, I thought -- why not rely on the colo's battery system?
Because they don't have one.
Mind you, this was also the colo that had a cardkey system that had long ago stopped being usable, so when you needed access you used a Radio Shack $29.99 wireless intercom system and someone would come to open the door, and when you checked in they carefully wrote your name on a little nametag.
I think standards have slipped, significantly. In some respects, this is likely a good thing -- it means you have more options now, because you can choose either the super duper "we hook up to two countries' power grids, have eight flywheels and a direct feed from microwaves in orbit" or the "err, here's your cabinet. We'll give you decent power until we don't" options.
So
Oh, wait...
Re:Disclaimer: I am Not an Electrical Engineer (Score:3, Informative)
Re:Disclaimer: I am Not an Electrical Engineer (Score:5, Informative)
To all the people accusing LJ of being stupid for not having UPS systems, Internap has 3 fully redundant power systems (yes, I know, didn't help much) so most people probably don't feel the need to run their own ups.
Re:Disclaimer: I am Not an Electrical Engineer (Score:3, Interesting)
I find this Risk site [ncl.ac.uk] to be very interesting reading, especially when it talks about some failure issues and scenarios.
My favourite was about Squirrel that took down the Nasdaq [ncl.ac.uk]. (I've also heard squirrels/mice/rats etc called "self propelled short circuits", but that's another story)
Now, I've been involved in systems architecture design, planning, and management for years, and I think that a lot of people drastically underestimate just how fsc
The Big Red Switch (Score:4, Funny)
That does happen. I remember working at Purolator Courier's data center in NJ back in -- oh, geez, mid-80s some time. I was a third shift print operator, helped out with the mag tape library too. One night the trouble alarm went off on the fire suppression panel. We'd been having trouble with it all week, and the alarm guy was due in in the morning. One of the newbie operators -- the only one at the console at the time, the others being on a smoke break or asleep in the tape library -- panicked and went over to the annunciator panel. He opened it as I watched him from the console area. I think he thought the halon was about to dump because he reached around the panel and instead of hitting the halon dump abort, he hit the emergency power cutoff.
BLAM! It was as if a firecracker went off as all the breakers tripped and the fans came to a sighing halt. Both on this floor -- the one with the console and the tape drives -- and the floor above, with the CPU and the disk farms. Dead as a doornail.
Now, this was Purolator COURIER. We had AIRPLANES coming in to land at Indy center and as of this moment, no way to tell the crews which gate to go to, where to unload their stuff, or how to sort it.
Not only that, but this was an IBM mainframe shop -- S/390, the Big Iron, with 3380 disk drives. You don't just flip the power switch back on. An emergency power cutoff blows breakers in the power supplies on those DASD strings. The IBM Field Engineer was duly dispatched and arrived with cases of breakers the next morning. But we were still dark when I got off shift the following morning.
The next night a brand new plexiglass cover was mounted over the Big Red Switch.
Gee, I wonder... (Score:5, Funny)
"Update #1, 7:35 pm PST: we're up on 'dirty' power for now (it works, but it's unreliable)".
Congrats to LiveJournal for assembly a coal generator in a record time.
Re:Gee, I wonder... (Score:2)
-assembly
+assemble
Re:Gee, I wonder... (Score:3, Informative)
Update (Score:3, Interesting)
Update #1, 7:35 pm PST: we're up on 'dirty' power for now (it works, but it's unreliable), and we're working to assess the state of the databases. The worst thing we could do right now is rush the site up in an unreliable state. We're checking all the hardware and data, making sure everything's consistent. Where it's not, we'll be restoring from recent backups and replaying all the changes since that time, to get to the current point in time, but in good shape. We'll be providing more technical details later, for those curious, on the power failure (when we learn more), the database details, and the recovery process. For now, please be patient. We'll be working all weekend on this if we have to.
Lovely. I just bought another year's subscription for my wife, figuring the change to Six Apart wouldn't change anything for a few months at least. LJ could lose a lot of subscribers with an outage just after the takeover.
Look at me! Look at me! (Score:4, Funny)
Six Apart is hosting them already? (Score:4, Insightful)
Last post before blackout: (Score:2)
"LiveJournal Servers Go Down" (Score:5, Funny)
With thousands of teenage girls unable to ponder in an open forum whether or not to blow their boyfriends, thousands of teenage girls go down.
Where's my irony stick? (Score:5, Insightful)
Re:Where's my irony stick? (Score:3, Interesting)
If the datacenter that hosts Slashdot was to have a massive power failure how long would
That said my company has gear in the same datacenter as LJ, our servers were back up 10 minutes after power was restored. Then again we use Oracle on HP-UX with nice SAN RAID boxes for storage for our database. So our stuff tends to recover from a sudden power loss a little better than a MySQL derivative running on clone hardware.
No... (Score:5, Insightful)
What does Six Apart have to do with Internap? Livejournal has been using - and wanting to switch from - Internap for a long time.
Value of Livejournal - "Open Source Philosophy" (Score:5, Interesting)
Re:Value of Livejournal - "Open Source Philosophy" (Score:3, Insightful)
Oh yes. If I ever feel the need to post any of those quiz-things I make good use of the <lj-cut> tag. So if anyone on my Friends list (or a random person finding my Journal) doesn't want to see the results they don't have to.
Actually one of the more useful LJ Features i know of is one that allows you to screen out images over a set size from your Friends list. So you need to view the entry in question to see the image, which is good for your bandwidth and/or narrow page layout.
Before you get all down on LJ... (Score:4, Informative)
Not related to Six Apart (Score:3, Insightful)
From the article write-up (and reflecting the thoughts of quite a few of the comments I just read):
I'd love to know what makes you think this has anything to do with Six Apart. The very first line at http://www.livejournal.com states:
They've been with Internap for years, predating Six Apart's takeover. Unless LJ staff is lying, the fault here sounds like it lies entirely with Internap.
And as far as I can tell, Six Apart didn't ditch the LJ team when they bought them out, so you probably have the exact same people working on bringing the site back up now as you would have if Six Apart had never got involved.
bigger explination (Score:5, Insightful)
That being said, LJ's servers are back up now, but they're making sure that the databases are all in sync -- LiveJournal has one of the most massive distributed MySQL clusters in existance along with a complete caching system.
They need to make sure that the database is all synchronized before bringing it back up -- chances are they're going to rebuild the cache too. If they didn't, the initial strain on the DB servers would probably bring the site down again.
This does however, bring up some questions about LiveJournal's network infrastructure. Danga (the creaters of LJ, recently purchased by Six Apart) are heavy users of Perl and MySQL. Needless to say, they have made numerous contributions to both projects and have developed an innovative memory caching system for linux.
The questions raised however, come from Perl and MySQL. Both are questionable in terms of scalability. Although I'm not qualified to comment on this, I belive that the general concensus is that MySQL is one of the least efficent databases today. Livejournal has 100+ servers. I honestly don't think that a system the size of LiveJournal should require a server cluster that big. It seems that they are trying to solve their performance/reliability problems by blindly throwing hardware at it.
Of course, I love livejournal. It's simple, easy to use, and is a great tool for building communities. Just as it is simple, it can also be incredibly nerdy (there's actually a command prompt!). They're also completely open source.
Hopefully, Six Apart can make their network infrastructure more 'professional' while still maintianing the community spirit that has made it so successful.
Re:bigger explination (Score:5, Insightful)
Sure, MySQL has its flaws -- some of them pretty big -- but we can work around them.
As for the "not needing a server cluster that big" -- do you have any clue how much data we push in an average day? We maintain so many DB clusters to improve reliability, and we maintain so many web nodes because we push a screaming shitload of traffic.
Re:bigger explination (Score:3, Informative)
This will be bad PR... (Score:3)
Never mind common sense; it won't matter that if SixApart can be held responsible for failures at InterNAP's colocation facilities, they're a much bigger -- and more powerful -- company than most people have ever given them credit for...
Makes me wanna laugh (Score:3, Interesting)
They even bragged to me how their network uptime SLA is 100%! I mean good god, now I find out this is the SECOND time it's happened (from the livejournal update site)???
I'm glad I didn't go with them...
Like, what's wrong with you, people? (Score:3, Interesting)
I have a few "friends" there at LJ, some of them net.celebs, and I like their posts. It's the matter of whose writings do you find interesting, and you are free to be completely unaware of the rest. Why all the vitriol?
Re:./ed !!!! (Score:2)
Re:./ed !!!! (Score:4, Informative)
Oh hey, Slashdot just went down as I was typing this. Smooth.
Re:./ed !!!! (Score:5, Informative)
Anyway, as of Google's last crawl of the stats page [216.239.57.104] (shortly before the outage), there were almost 6 million LJ users, a little under half of those "active." I don't know if
Besides, a lot of the DB load on Slashdot is eased tremendously by Memcached [danga.com], developed by... Danga Interactive, i.e. LJ. Wikipedia uses it too, and just started using Perlbal [danga.com]. (And I do mean "just" [danga.com]) Ditto for Audioscrobbler/Last.fm. So
Re:./ed !!!! Server Reboot Time? (Score:5, Interesting)
This is another thing that bothers me about this scenario. I can't say that I've ever admined 100 servers, the most I've ever had was about 30, but if we had a power loss of any kind, you'd just repower them and walk away. Most of them were DEC Alpha gear running Tru64. Why would you spec out a box that has to be handheld every reboot? The only time you should have to handhold a server is during an upgrade. A power cycle without proper SIGHUP or term signals should just run fdisk on it's way back up. (K, so it might take an hour for the server to go live again, but still.) I mean, am I missing something here? Maybe since nothing I've admined got the traffic these things do .... I'm just lost. Some one hit me with the clue by four.
The only thing I can even think of is they have explicit services that must be started manually ..... but why would you want that? If you have a power hiccup in the middle of the night, you want it to come back up, and be live and happy again *before* you even get the first page. I mean sure, if there was a surge, and that destroyed components, and those components have to be replaced ..... but ..... a reboot is a reboot, man. Here, smoke some source. It's the good stuff.
Re:./ed !!!! Server Reboot Time? (Score:5, Insightful)
But we intentionally don't have databases come back up on boot because if there was a blip, we want to do an integrity check first. (we run InnoDB, so it's ACID, but we're paranoid
We have clusters of 2 identical databases in separate cabinets, separate switches, separate Internap power feeds... so normally losing one database in each cluster doesn't matter: the other one gets used. But when we lose every single database, in all clusters, all at once... that's the time to be paranoid and double check stuff.
Re:./ed !!!! Server Reboot Time? (Score:3, Informative)
I'd have to agree with the AC, Brad, stop posting to slashdot and hover over that DB rebuild a bit more.
(Yes, posting to slashdot relieves tension... Whatever it takes, Brad.)
Re:./ed !!!! Server Reboot Time? (Score:5, Informative)
So lots of waiting now on the checksum validators. I don't want to put a machine back in and find out in a week there was a database page that was corrupt because the battery-backed write-back cache on the RAID card didn't work as advertised. (which happens on about 95% of RAID cards, in my experience, because they're mostly crap, even the most expensive ones...)
Also whenever there's any doubt about something's integrity, we backup or snapshot the potentially corrupt version before operating on it. That operation can take time too.
It's going to be a fun night.
Re:./ed !!!! Server Reboot Time? (Score:4, Funny)
if my wife cant post this weekend, im gonna hear about it. and not even be able to post my lj about getting yelled it about lj being down as if i caused the power outage myself.
not really.
well maybe.
Cheers.
Re:It can happen (Score:2)
Re:It can happen (Score:2)
Re:It can happen (Score:2)
+"the data center"
You do realize that they don't own Internap, and that the *data center* died, right? They (Danga/Six Apart) probably were paying for battery backup and generator backup. It is not even *close* to their fault that the co-lo center didn't have such things, or had them improperly set up.
Re:Bad IDea. (Score:3, Informative)
The Slashdot effect is more visible because we send all our readers to one place at the same time, while LJ is highly distributed.
Re:A Good Laugh (Score:2)
more like tetanus (Score:2)
lj == lockjaw
Blog, blog, blog, blog... (Score:3, Funny)
Lovely blog!
Wonderful blog!
Blog blo-o-o-o-o-og blog blo-o-o-o-o-og blog.
Lovely blog! Lovely blog!
Lovely blog! Lovely blog!
Lovely blog!
Blog blog blog blog!
-- The Viking Blog Song
Re:Slashdot blogging for a fix (Score:3, Interesting)
But I too find it irritating that a service I use, that is supposed to be backed up (my acco