Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet

Power Outage Takes Wikimedia Down 577

Baricom writes "Just a few weeks after a major power outage took out well-known blogging service LiveJournal for several hours, almost all of Wikimedia Foundation's services are offline due to a tripped circuit breaker at a different colo. Among other services, Wikimedia runs the well-known Wikipedia open encyclopedia. Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers. They have established an off-site backup of the fundraising page here until power returns."
This discussion has been archived. No new comments can be posted.

Power Outage Takes Wikimedia Down

Comments Filter:
  • by Anonymous Coward on Monday February 21, 2005 @10:29PM (#11741200)
    They'll turn the lights off.
  • by Faust7 ( 314817 ) on Monday February 21, 2005 @10:29PM (#11741204) Homepage
    Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers.

    "You see, guys? This is what could happen if we ever ran out of money. Now cough up some dough!"
  • What Happened. (Score:5, Informative)

    by Anonymous Coward on Monday February 21, 2005 @10:29PM (#11741206)
    What happened?
    At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers.

    What's wrong?
    After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.

    The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database. We're currently running full backups of the raw data on two other database slave servers prior to attempting recovery on them (recovery alters the data).

    If these machines also can't be recovered, we may have to restore from backup and replay log files which could take a while.
    • Comment removed (Score:5, Interesting)

      by account_deleted ( 4530225 ) on Monday February 21, 2005 @10:34PM (#11741252)
      Comment removed based on user account deletion
      • by YU Nicks NE Way ( 129084 ) on Monday February 21, 2005 @10:46PM (#11741322)
        There's a simple way around this: stick to PostgreSQL, MSSQL, Oracle, DB/2, or some other real database. MySQL doesn't make the grade, precisely because things like this can happen.
      • by ctr2sprt ( 574731 ) on Monday February 21, 2005 @10:50PM (#11741348)
        We have a similar problem at work. There we don't endure database corruption, we just get broken replication. It appears to be working, but it actually isn't. So we have to take the master offline (actually just acquire a write lock on the DB, it can still answer SELECTs), tar up its (massive) database, scp it to the slaves, start the master, stop the slaves, untar the database, restart the slaves, and restart replication. The entire process can take several hours and it's easy to make mistakes. We put stickers on our MySQL servers saying "DO NOT REBOOT WITHOUT CONTACTING OPS MANAGEMENT," though unfortunately faulty DIMMs are illiterate.

        I don't know if PostgreSQL has similar problems, but I very much doubt that Oracle or DB2 do. I know that improved failover support has been a target of the PSQL developers for a little while now, so while it may not be on par with Oracle and DB2 it's probably closer than MySQL. At least for now.

        I wish this had prompted management to consider alternatives to MySQL, at least for our mission-critical database servers, but unfortunately it hasn't. They don't even see that we could sell an enterprise-level RDBMS as a significant feature - we're a webhosting company - and charge through the nose for it. Oh well. They don't listen to peons like me, they just make me fix MySQL replication every two weeks.

        • by Dachannien ( 617929 ) on Tuesday February 22, 2005 @12:15AM (#11741766)
          So we have to take the master offline (actually just acquire a write lock on the DB, it can still answer SELECTs), tar up its (massive) database, scp it to the slaves, start the master, stop the slaves, untar the database, restart the slaves, and restart replication.

          You forgot the part where you have to take the chicken across first, because the fox won't eat the grain if you leave them alone.

        • by TheNarrator ( 200498 ) on Tuesday February 22, 2005 @02:31AM (#11742334)
          PostgreSQL is far superior to MySql in it's disaster recovery ability, namely WAL (Write Ahead Logging). I've been using PostgreSQL since version 7.0 came out and I've never had it fail to come back up on me after any power outage or reset.

          http://www.postgresql.org/docs/8.0/interactive/wal .html [postgresql.org]

        • Unfortunately for webhosting the demand for MySQL is higher than for the other available DBMS's, since most available open source software and gratis software that requires a database is going to have been developed originally with MySQL. I would much prefer to be using PostgreSQL for the applications I run with a hosting provider, but the apps I use don't function with it, and the hosting provider (NearlyFreeSpeech.net) doesn't offer anything else, anyway.

          I figure once the advantages of the other DBMS s

    • Re:What Happened. (Score:2, Insightful)

      by wakejagr ( 781977 )

      Kudos to Wikimedia for actually explaining what happened and not just putting a "This page is down, please try again later" messege up. Many people/companies/groups/etc would be too proud or too afraid of bad publicity to actually explain the problem.

    • Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state

      Someone please remind me again why massive databases are not yet being implemented with simple discrete file storage on ReiserFS [namesys.com]. Sure, MySQL will be faster once in memory but it sounds like the price you pay is lack of robust storage and difficult backup/recovery -- probably the most important part of running a database.

  • News Update (Score:5, Funny)

    by Anonymous Coward on Monday February 21, 2005 @10:31PM (#11741227)
    After returning from the power outage, the servers have just been slash-fried.
  • by PornMaster ( 749461 ) on Monday February 21, 2005 @10:33PM (#11741237) Homepage
    If they bought actual servers with dual power supplies and got power from multiple PDUs at their data center, they would be much better off. If this is really because of a tripped breaker, then it's pretty inexcusable, since dual power supplies fed from separate circuits would have prevented it... unlike the LJ outage which was from the power being cut to all circuits.

    But if they're going to cobble together some whitebox crap servers, and not change the architecture, they'll be right back to an outage next time it happens.
  • by Anonymous Coward on Monday February 21, 2005 @10:34PM (#11741248)
    Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state

    Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.

    Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".
  • Power outages suck, and a great way to protect from them is to distribute your project over a large area of electrical service.

    I know the wikimedia folks are fundraising for more servers, but I wonder if this will provide more incentive to accept Google's offer?
  • by Anonymous Coward on Monday February 21, 2005 @10:40PM (#11741281)
    I found this useful information about power outages:
    http://www.wikipedia.org/search?/power_o utage
  • by mctk ( 840035 ) on Monday February 21, 2005 @10:44PM (#11741305) Homepage
    A power outage [abovetopsecret.com] has taken down [canoe.ca] wikipedia [wikipedia.com]! as [as.com] a community [vbulletin.com] we [petiteanglaise.com] must carry the torch [bonnint.net]!
  • by ral315 ( 741081 ) on Monday February 21, 2005 @10:48PM (#11741334)
    Even when the servers go back on, they'll be slashdotted.
  • by Raul654 ( 453029 ) on Monday February 21, 2005 @10:52PM (#11741368) Homepage
    As that economic genius, Eric Cartman taught us [tvtome.com]:

    1) Get something other people love
    2) Don't let them use it
    3) Profit!

    It doesn't hurt if you are running a fund drive at the same time, either.
  • by Jamesday ( 794888 ) on Monday February 21, 2005 @10:55PM (#11741386)
    So far one of our database servers has completed a successful recovery (we're working through them all). On a gigabit link it takes something between 90 minutes and 4 hours to rsync from one to another. As soon as we have two database servers working, we'll be restoring service in read only mode. Likely to be that 90 minutes to 4 hours from now as worst case.

    I'll post followups to this post later, as we're closer to being fully recovered.
  • I remember once my mail service provider [fastmail.fm] went offline too a year or so back due to power failure but fortunately they had diesel generators for backup power. Dosen't Wikimedia has the same facility?
  • by femto ( 459605 )
    Isn't raising money for servers a short term solution? Surely the real solution is to invest time and effort into finding a way to distribute wikipedia across the 'net?

    Google seems to have succeeded in building a distributed platform. What about something similar to seti@home, which takes a chunk of each user's disk space and bandwidth and uses them to implement a virtual computer on which wikimedia projects may be run?

    Surely someone is already working on something like this (pointers anyone??)

    • Well, distributing a wiki is a task a bit more complex than distributing search index (async!) or seti@home (async). You don't care in async data arrays wether the packet you sent to some node is hour or day old. You care about that in wiki, because every user will be pressing 'edit' button, and data should be consistent everywhere. We are working on distribution.
      • Distributed caches - now majority of hits are served by caches, and some of them are offsite. It was a pilot project for a while and now we're
    • 170GB isn't that big and people routinely run far more critical stuff without any kind of exotic seti@home-like distribution. What's really inexcusable is the fact that a power failure caused database corruption that turned a 2 minute power outage into major downtime.
  • by mrpuffypants ( 444598 ) * <mrpuffypants@gm a i l . c om> on Monday February 21, 2005 @11:00PM (#11741417)
    it's as though 300,000 people cried out and were suddently silenced ...

    and then somebody diffed the change and made them speak again
  • URI to the Rescue (Score:4, Interesting)

    by Doc Ruby ( 173196 ) on Monday February 21, 2005 @11:03PM (#11741430) Homepage Journal
    This outage, as well as our beloved slashdotting, is yet another argument for URIs, rather than just URLs. URLs are like IP#s; they're absolute pointers to specific object locations, in terms of the storage/retrieval interface of a single instance. URIs are virtual, like domain names. They are distributed in DNS, a Netwide database, updated for current lookup values for actual retrieval. URLs need the same kind of layer. Of course, some other characteristics of these objects must be reflected in the URI model that are not appropriate to IP#/domain names, like multiple identical copies, or perhaps versions.

    Just cacheing copies, either actively with a redirection URL, or passively in caching backbone webservers, isn't cutting it. Caching values is always better suited to solving performance problems, creating its own concurrency and identy problems. Not to mention the publication limits of "opt-in" caches, like Coral or Google, which are an afterthought (and usually unknown) to the published object itself. Google has a huge, high-performance URL lookup system. It's taken quite a bit of value from the Internet, and all the content creators it rides on to derive all its value. It give back quite a bit, with its simple, fast, effective interface. Google is perfectly positioned to make its name truly synonymous with an Internet revolution (not just a pinnacle of search evolution) by implementing URIs. If Google let objects get looked up by a URI code as simple as say, [A-Za-z0-9]+, it could get halfway to its namesake [google.com] in objects with just 28 "digits"; just 7 digits would cover each object instance in its database right now, dozens of times over. If Google opened up such a URI protocol to anyone on the Web running such a "DIS" server, just like DNS, they could offload much of the work, avoid accusations of trying to "own the Internet", and improve their own service immeasurably, not least by making broken links in their database a quaint old curiosity. Will they rock our world, or will another big player, like Archive.org do it, before Microsoft, desperate to distinguish MSN Search, ruins it for everyone with some kind of proprietary hack that favors MS objects?
    • URLs contain a domain name. Domain names already provide a level of indirection. Why can't we use that level of indirection for Wikipedia's problem? I don't see what URIs buy us -- if we're already not using the indirection we have, how does a second level give us?
      • Re:URI to the Rescue (Score:4, Informative)

        by Doc Ruby ( 173196 ) on Monday February 21, 2005 @11:19PM (#11741524) Homepage Journal
        Because domain names equate to a single IP# (even if that number changes) - a single instance of the object. A URI is just a unique ID across the whole Net, for an object class, which can have single instances. A good URI scheme will take different states of that class into account, like different versions of the object. Domain names, as implemented in DNS, can't give us the one (URI) to many (instances) we obviously need to support scalability and distributed objects.
    • If you would have used newlines and/or paragraphs I probably would have read it.
  • From the google cache of their hardware growth planning [64.233.161.104]:
    Question - don't you think a UPS system would also be a wise investment?
    • Re:Ironic (Score:5, Informative)

      by Jamesday ( 794888 ) on Monday February 21, 2005 @11:33PM (#11741583)
      Yes. I wrote that cached page and it's now a bit out of date. IF, and it's not certain, local fire regulations permit the use of UPS systems in the racks we're going to be installing them. Decided on that after LiveJournal's unfortunate experience. But don't yet have them.
  • That aint much. My older harddisk is 200. I'm planning to get a 400 gig one.

    I wonder if wikimedia will ship the whole wikipedia on a few bzipped DVD isos to people who want a not-so-up-to-date encyclopaedia. I was researching a period of 1200AD, not much chance that data will change in the next few months.

    And I DO wonder why doesnt another database company take up a mirror of wikipedia, just to show the reliability, speed, scalability etc of their database.... great marketing tool especially if you own al
    • Re:170 gigs? (Score:3, Interesting)

      by brion ( 1316 )

      The vast majority of this space is taken up by revision histories (and those are compressed!) Periodic database dumps [wikimedia.org] are available for download. Image and multimedia uploads have been taking up a bigger share lately, but those are on a separate server which recovered just fine.

      A German company has published an end-user-friendly CD-ROM of material from the German-language Wikipedia, but afaik no one's published an English-language edition yet.

  • Answers.com (Score:4, Informative)

    by stevemm81 ( 203868 ) on Monday February 21, 2005 @11:30PM (#11741568) Homepage
    You can look things up on answers.com.. They mirror wikimedia, as well as other dictionaries/encyclopedias.
  • What's the name of Wikimedia's colo?

    Ron
  • by kiwidefunkt ( 855968 ) on Monday February 21, 2005 @11:33PM (#11741587) Homepage
    As soon as I saw "Power corrupts. Power failure corrupts absolutely" I thought, the damn commies [kapitalism.net] finally did it! But no, not hacked by commies...just by a renegade circuit breaker.
  • why, why, why? (Score:3, Insightful)

    by CAIMLAS ( 41445 ) on Monday February 21, 2005 @11:42PM (#11741630)
    Why were they not using battery backup on their database servers (IE, their critical servers)? That way the servers would have the necessary 10 minutes (or whatever) so that they can shut down the DBs and power off the systems.

    This is a negligible cost for something as integral as an active sync with the work that people have performed - for free.

    Why is this not seen as important? "The wiki users will just recreate the material"? That's somewhat presumptuous.

    Now, livejournal I can understand not doing this (as there are many clients which allow people to sync with their online journals and the material is fairly culturally worthless), but wikipedia? It's one of the better things on the Internet.
  • Oh...ok... (Score:3, Funny)

    by buffy ( 8100 ) * <buffy@p a r a p e t .net> on Tuesday February 22, 2005 @12:46AM (#11741881) Homepage
    So, _this_ is where I should be posting my outage reports! And here I've been sending them only to people who would care.

    "Slashdot...outage reports for nerds! Stuff that doesn't matter to me!"

    Lol!

    -buf
  • MySQL not ACID (Score:3, Insightful)

    by Tough Love ( 215404 ) on Tuesday February 22, 2005 @01:06AM (#11741975)
    From the wikipedia page:

    At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers. (Yes, even the machines with dual power supplies -- both circuits got shut off.)

    After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.

    The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state.


    (Bolding mine.) This proves that MySQL is not ACID, there is no way that a power outage is supposed to cause corruption in a database. This is not a troll, this is a simple conclusion. I really think that Wikipedia should switch to PostgreSQL, which is considerably more mature in terms of ACID compliance.
  • by friendscallmelenny ( 746745 ) on Tuesday February 22, 2005 @01:27AM (#11742079)
    Yesterday they had frontpage Scientology entry with Xenu stuff. I told my friend, "That site will be in trouble soon."
    He thinks I'm a god now!

    perhaps I just inadvertently reached clear

  • by iwadasn ( 742362 ) on Tuesday February 22, 2005 @01:52AM (#11742187)

    Apparently one of their MySQL databases got corrupted as well. Figures. You'd think with all that volume they'd be wise enough to use a DB that can withstand a hard powercycle without losing data.

    Just remember, friends don't let friends use MySQL for important data.

  • by shoemakc ( 448730 ) on Tuesday February 22, 2005 @02:02AM (#11742225) Homepage

    :::eyes my UPS::::

    ::::ponders for a momment::::

    :::eyes the serial cable that gracefully shuts down said computer in the event of a power failure::::

    :::ponders some more::::

    :::eyes the spare UPS sitting in the corner that used to be connected to a database server::::

    Hmm, I think i'm almost onto something here, but i just can't seem to nail it down...

    -Chris

  • by Jugalator ( 259273 ) on Tuesday February 22, 2005 @03:19AM (#11742454) Journal
    The link in the article is broken, here's the proper one:
    http://wikimedia.org/fundraising/ [wikimedia.org]
  • Latest news (Score:5, Informative)

    by saforrest ( 184929 ) on Tuesday February 22, 2005 @09:27AM (#11743602) Journal
    Posted on the mailing list wikipedia-l 32 minutes ago:

    From: Brion Vibber
    Reply-To: wikipedia-l@wikimedia.org
    To: Wikipedia-l, Wikimedia Foundation Mailing List, Wikimedia developers
    Date: Tue, 22 Feb 2005 04:47:56 -0800
    Subject: Re: [Wikipedia-l] Wiki Problems?

    Brion Vibber wrote:
    > There was some sort of power failure at the colocation facility. We're
    > in the process of rebooting and recovering machines.

    The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.

    Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)

    The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.

    The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

    I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

HELP!!!! I'm being held prisoner in /usr/games/lib!

Working...