Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

LiveJournal Servers Go Down 596

Wind writes "According to any journal hosted off of LiveJournal.com, the LiveJournal data center Internap has suffered a critical power failure, leaving all of LiveJournal and its content temporarily offline and requiring the revival of 100+ servers. Perhaps Six Apart wasn't quite prepared for the responsibilities of a website of this size? Updated information is posted here."
This discussion has been archived. No new comments can be posted.

LiveJournal Servers Go Down

Comments Filter:
  • Internap is *down*? (Score:5, Informative)

    by MightyTribble ( 126109 ) on Friday January 14, 2005 @11:33PM (#11370833)
    Internap *down*?
    Bush just appointed Internap's CEO to his National Infrastructure Advisory Council [tmcnet.com], yet the man can't keep a co-lo facility switched on.

    I'm not sure what that says of Bush or of Interap. And it certainly doesn't seem to have anything to do with SixApart.

  • by La Camiseta ( 59684 ) <me@nathanclayton.com> on Friday January 14, 2005 @11:39PM (#11370883) Homepage Journal
    Use the Coralized link [nyud.net]. No sense in crashing their status page. Plust it'll respond a lot quicker than loading the actual web page.
  • by Peter Cooper ( 660482 ) on Friday January 14, 2005 @11:50PM (#11370953) Homepage Journal
    To be honest, a deal was announced what.. a week ago? I seriously doubt Six Apart has control over anything at this point.
  • by ebooher ( 187230 ) on Friday January 14, 2005 @11:53PM (#11370971) Homepage Journal

    I know nothing of how InterNap is set up. I just want to throw that out there ahead of time. Now, it's time for my patent pending "Bull Shit Theory of the Day."

    Ok, here is the rant. I used to work for a Colocation facility. Nothing special, small by Telco terms. The whole facility only had about 1500 cabinets. (Though I hear they are now full, and going to be expanding.)

    We had a main power draw off of the local grid. We had a backup power draw off of the *next* cities power grid. (ie, when all the offices around us went dark, we still had power.) And you don't even want to know the kind of red tape we had to go through for *that* pull. I'm still not sure how they did it. We had fly wheel kinetic electricity storage systems, battery backups, and a diesel engine from a train so large it had it's own building.

    We used to joke that if we lost power, we had more important things to worry about. And again, we were small time compared to some of the massiveness that is out there. *cough*AADS Chicago*cough*

    So I'm kind of in agreement with the statement currently on LiveJournal. It's unknown to me how any self respecting colo facility can say "We've had a power outage that also took our redundant systems."

    I have to call bullshit on that entire train of thought. If that's true then they don't *have* any redundant systems, and I'd be looking for a new provider. The most likely thing (at least in my mind) is that someone, somewhere got mad at something specific and decided to make a point by popping the main breaker to their portion of the facility.

    Oh, that was another thing, each room had several "main" breakers. It took a hell of a power surge to pop all of them, and the Liebert systems had power filters of some kind, really really big capacitors or something I think, so a surge really never made it to the other side anyway, it got stored in the cap and then trickled out like the rest of the power.

    But I was a UNIX admin, not the EE that was planning the power generation aspects of the facility. So take some of it with grains of what ever white powdered spice you prefer.

  • Re:Bad IDea. (Score:3, Informative)

    by jamie ( 78724 ) <jamie@slashdot.org> on Friday January 14, 2005 @11:57PM (#11370999) Journal
    It is. Slashdot gets about 1/10th the pageviews of LJ.

    The Slashdot effect is more visible because we send all our readers to one place at the same time, while LJ is highly distributed.

  • Re:What? (Score:1, Informative)

    by Anonymous Coward on Saturday January 15, 2005 @12:21AM (#11371140)

    Ah I always thought there was redundant power backups for just such an occasion ?

    No doubt that something went wrong there, but that doesn't change the fact that it's the data centre's responsibility to supply power, so only a complete moron would suggest SixApart were to blame.

  • by TrevorB ( 57780 ) on Saturday January 15, 2005 @12:23AM (#11371154) Homepage
    For those people who might not know, Brad Fitzpatrick is Livejournal User #1.

    I'd have to agree with the AC, Brad, stop posting to slashdot and hover over that DB rebuild a bit more.

    (Yes, posting to slashdot relieves tension... Whatever it takes, Brad.)
  • by Anonymous Coward on Saturday January 15, 2005 @12:24AM (#11371155)
    My friend's company is hosted by internap. Today he messaged me when the power went down. It was only power to the second floor, my friend's servers, while cut off to the internet were still running (on the 3rd floor). Internap has redundancy and backup generators (and enough fuel onsite to run for 30 days without external power). Apparently there was construction occuring on the second floor... my guess is that some dipshit contractor cut through a power cable or 3 and took the whole floor down.

    To all the people accusing LJ of being stupid for not having UPS systems, Internap has 3 fully redundant power systems (yes, I know, didn't help much) so most people probably don't feel the need to run their own ups.
  • Actually, about a year ago, they had some months of bad performance and gave all paid members an additional 2mo (or so, I forget exactly) of paid member-level service, free of charge.
  • by mizalaina ( 849977 ) on Saturday January 15, 2005 @12:36AM (#11371232) Homepage
    I work at a co-lo facility now. The problem is probably that what people call redundant power often isn't highly available, nor is usage distributed correctly across the primary and redundant circuits. If one half of your power fails and you've mis-used or overloaded your redundant circuit then the redundant circuit is going to fail when it can't take the load that gets switched over to it. This is a result of poor planning.

    Keep in mind that often people have back-up power that's not conditioned, which is what is indicated by LJ's message. If the power were redundant and both sides were through UPSes, there would be no dirty power at all. A lot of co-lo facilities go on the cheap and their back-up power is just another circuit from a different transformer or a different Hydro company. So think about it: if the grid, transformer or power switching infrastructure fails, and you only have one back-up generator that also fails, or your UPS batteries can't take the pressure, or any of two dozen other things, your power has gone bye-bye.

    My prediction (which we are already seeing at my job) is that power and cooling are the Next Big Problems for co-lo. With blade servers demanding 220V, 30A 3-phase power and pulling 8kVA in 6U of space, no data centre as currently designed will be able to handle that on the scale we're going to see develop in the next year or two. People assumed power and cooling were unlimited resources. We were wrong. Oops!

    BTW, if what LJ is saying is true, this has little to do with Six Apart or Danga. It's Internap's fault within that particular data centre. The sales engineers/technical consultants/whatever they're called at Internap should have thought about this and pushed for audits, but they probably didn't. I doubt Danga knew enough about the potential problem to make good decisions about it: they're just a customer and assumed that the power would work. It's an infrastructure thing, and while the customer should educate themselves, they often don't. It's why I bug my customers constantly with power audits and suggestions.

    Just something to think about. :)
  • by bradfitz ( 23252 ) on Saturday January 15, 2005 @12:39AM (#11371245) Homepage
    At this point all my whiteboards are full of boxes of each database cluster, the machines in that cluster, which have passed their checksum tests. (innodb checksums each 16k page), which replayed their replay/undo logs, where in binlogs each was writing/reading/executing etc...

    So lots of waiting now on the checksum validators. I don't want to put a machine back in and find out in a week there was a database page that was corrupt because the battery-backed write-back cache on the RAID card didn't work as advertised. (which happens on about 95% of RAID cards, in my experience, because they're mostly crap, even the most expensive ones...)

    Also whenever there's any doubt about something's integrity, we backup or snapshot the potentially corrupt version before operating on it. That operation can take time too.

    It's going to be a fun night.
  • by Bloodlent ( 797259 ) <iron_chef_sanjiNO@SPAMyahoo.com> on Saturday January 15, 2005 @12:50AM (#11371300)
    Just remember it's not ALL obnoxious, over-emotional teen-angst teenage girls. I use mine to showcase (non-depressing)poetry and make intelligent comments about intelligent topics. Basically, if someone makes an LJ about their own life, it sucks. If you can manage to write an LJ and make it about things that matter to more people than just you(ie, "Why Bush's Iraqi war is unjust" vs. "Why this babe I know should bang me"), and at the same time make it funny and enjoyable to read, then you have a good LJ. Most LJs DO suck, but there are some diamonds in the rough.
  • Re:Gee, I wonder... (Score:3, Informative)

    by Rie Beam ( 632299 ) on Saturday January 15, 2005 @01:15AM (#11371426) Journal
    Sir, I will fight your advice until my grave.
  • Update 2 (Score:2, Informative)

    by KinkifyTheNation ( 823618 ) on Saturday January 15, 2005 @02:17AM (#11371649) Journal
    Update #2, 10:11 pm: So far so good. Things are checking out, but we're being paranoid. A few annoying issues, but nothing that's not fixable. We're going to be buying a bunch of rack-mount UPS units on Monday so this doesn't happen again. In the past we've always trusted Internap's insanely redundant power and UPS systems, but now that this has happened to us twice, we realize the first time wasn't a total freak coincidence. C'est la vie.
  • by supersat ( 639745 ) on Saturday January 15, 2005 @02:22AM (#11371666)
    According to some LiveJournal employees, a massive UPS exploded. From IRC:

    <rahaeli> As far as we can tell, a UPS exploded.

    Their site now says that they're buying their own UPSes, because this is the second time that the entire data center has lost power. Details on the first outage can be found here [216.239.63.104] (a Google cache since LJ is down).

    For the paranoid: This has nothing to do Six Apart buying LJ. They're still in the same "world-class" data center they've been in for years.
  • Re:./ed !!!! (Score:4, Informative)

    by Hooded One ( 684008 ) <hoodedone@gmai l . c om> on Saturday January 15, 2005 @03:02AM (#11371753) Journal
    You do realize that LiveJournal handles far more traffic [alexa.com] than Slashdot, and when Slashdot got linked on the front page of LJ, Slashdot started spewing out errors (more than normal).

    Oh hey, Slashdot just went down as I was typing this. Smooth.
  • by andfarm ( 534655 ) on Saturday January 15, 2005 @03:14AM (#11371777)
    InnoDB *is* MySQL.
  • by ces ( 119879 ) <christopher...stefan#gmail...com> on Saturday January 15, 2005 @06:28AM (#11372275) Homepage Journal
    The co-location facility in question has plenty of backup UPS power with plenty of generator capacity behind that. Supposedly there is enough generator capacity to fully power everything in the building including the network TV station even with one generator out.

    The UPS gear in Internap's space is all top-of-the-line big datacenter grade stuff. Apparently there was some sort of wiring fault in one of the new UPSes they were bringing online that caused both building power to fail and the self-protection circuits in all of their UPSes to trip.

    IOW it was either a faulty UPS or a faulty wiring job by the electrical contractor.

    Livejournal isn't the only ones who got burned by this outage. The colocation facility in question is supposedly one of the most solid in the state and nothing short of a direct strike from a comet is supposed to be able to take it offline. My company was in the same boat as our gear is in the same facility as LiveJournal's.

    Sure both LiveJournal and the company I work for could have hedged our bets by having redundant gear in another facility in another state, but that is a pain in the ass especially when backend databases are involved. To tell you the truth it probably isn't really worth the bother unless you truely have a need for six nines of uptime.
  • Re:./ed !!!! (Score:5, Informative)

    by Hooded One ( 684008 ) <hoodedone@gmai l . c om> on Saturday January 15, 2005 @08:17AM (#11372498) Journal
    The Alexa link was the only tangible example I could find. I distinctly recall seeing a post by Brad himself mentioning how much more traffic LJ handles, but obviously I can't link to it at the moment.

    Anyway, as of Google's last crawl of the stats page [216.239.57.104] (shortly before the outage), there were almost 6 million LJ users, a little under half of those "active." I don't know if /. has any stats available, but skimming through this page, the highest UID I see is in the 800,000 range. I'm not going to even attempt to guess what the relative activity level of LJ users is compared to /., or which has bigger pages or whatever, but I would offhand say that LJ probably handles more image traffic (user pictures, and now the in-testing photo hosting service). I know they used to use Akamai for that, but I seem to recall that fairly recently they switched over to doing something else. (I think they handle it themselves again, but I'm not sure.) There's also the audio files from phone posts. I'd say there's little question that LJ is the more heavily trafficked site.

    Besides, a lot of the DB load on Slashdot is eased tremendously by Memcached [danga.com], developed by... Danga Interactive, i.e. LJ. Wikipedia uses it too, and just started using Perlbal [danga.com]. (And I do mean "just" [danga.com]) Ditto for Audioscrobbler/Last.fm. So /. isn't in much of a position to pooh-pooh the technical ability of Brad/LJ.
  • by Council ( 514577 ) <rmunroe@gmaPARISil.com minus city> on Saturday January 15, 2005 @09:26AM (#11372659) Homepage
    Livejournal is something like 65:35 female:male.
  • by elemental23 ( 322479 ) on Saturday January 15, 2005 @12:06PM (#11373213) Homepage Journal
    Perhaps you're new here, but italicized text in Slashdot stories is written by the story submitter. Editorial comments, if any, are not in italics. In other words, Michael didn't say anything at all in this story.

    That said, the story submitter is clearly trolling himself, as neither 6A's nor LJ's staff had anything to do with the massive power failure at their co-lo.
  • Re:Internap Sucks (Score:3, Informative)

    by Scott Laird ( 2043 ) on Saturday January 15, 2005 @12:10PM (#11373242) Homepage
    A couple points. First, there's *nothing* that you can do about the "idiot hit the big red button" problem--you're required by law to have the button, because it's a safety issue. It has to be accessible--you can't lock it in a closet. And everyone knows that if you put a big red button on a wall, sooner or later someone's going to hit it.

    I don't know what happened this time, but the ~2002 Internap Seattle outage was caused by an idiot Speakeasy tech who couldn't figure out how to use the exit door, so hit hit the Big Red Button instead.

    I worked for Internap at the time, and I spent weeks stuck inside that colo facility. It was basically the only "dot-com" grade thing that Internap built (they were usually somewhat thrifty, at least pre-2001). It sparkled. Everything was over-engineered. You had to go through multiple rounds of security to get access to anything.

    The last I heard last night, no one quite knew what'd happened yet. Apparently, multiple redundant power systems all failed at the same time. This facility was designed by a company that already had ~5 years experience running high-end colo facilities, and it was designed as the flagship facility for showing off to potential customers. This isn't a hole-in-the-wall hosting place, it's more of a bunker hiding in the shadow of the Space Needle. So, frankly, it'll be very interesting to see what happened, because no money was spared to keep this sort of thing from happening.

    (Disclaimer: I haven't worked for Internap since 2002. I still own a bit of stock, because it's not worth the hassle of selling it for what little it's worth. It's not really the same company now that it was when I started in '98, and only a handful of my former coworkers are still with the company. I'm not even going to *start* with my opinion of the current management.)

I've noticed several design suggestions in your code.

Working...