
Power Outage Takes Wikimedia Down 577
Baricom writes "Just a few weeks after a major power outage took out well-known blogging service LiveJournal for several hours, almost all of Wikimedia Foundation's services are offline due to a tripped circuit breaker at a different colo. Among other services, Wikimedia runs the well-known Wikipedia open encyclopedia. Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers. They have established an off-site backup of the fundraising page here until power returns."
This is why you don't turn Google down (Score:5, Funny)
Coincidence... ;) (Score:5, Funny)
"You see, guys? This is what could happen if we ever ran out of money. Now cough up some dough!"
Re:Coincidence... ;) (Score:5, Funny)
Re:Coincidence... ;) (Score:5, Interesting)
Seriously though, if you like wikipedia, consider donating, even if it's just 5 bucks. I think it's even tax deductible if you itemize.
Re:Coincidence... ;) (Score:5, Interesting)
You do know that Wikipedia receives something like 100 times the traffic Slashdot does, right?
Re:Coincidence... ;) (Score:3, Interesting)
Re:Coincidence... ;) (Score:5, Informative)
Re:Coincidence... ;) (Score:2, Funny)
Re:Coincidence... ;) (Score:4, Insightful)
Aaaaand... (Score:5, Funny)
Don't worry, we'll take care of your backup servers in the meantime.
Re:Coincidence... ;) (Score:3, Interesting)
What Happened. (Score:5, Informative)
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers.
What's wrong?
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database. We're currently running full backups of the raw data on two other database slave servers prior to attempting recovery on them (recovery alters the data).
If these machines also can't be recovered, we may have to restore from backup and replay log files which could take a while.
Comment removed (Score:5, Interesting)
Re:mysql bad at disaster recovery? (Score:4, Insightful)
Re:mysql bad at disaster recovery? (Score:3, Informative)
Re:mysql bad at disaster recovery? (Score:3, Insightful)
I mean, if a few servers' databases survived, that may speak more of random luck of not being in a status so when the power outage occured nothing bad happened. If all of the databases survived, that speaks of MySQL being resistant to this sort of thing.
Re:mysql bad at disaster recovery? (Score:3, Informative)
Re:mysql bad at disaster recovery? (Score:5, Interesting)
I don't know if PostgreSQL has similar problems, but I very much doubt that Oracle or DB2 do. I know that improved failover support has been a target of the PSQL developers for a little while now, so while it may not be on par with Oracle and DB2 it's probably closer than MySQL. At least for now.
I wish this had prompted management to consider alternatives to MySQL, at least for our mission-critical database servers, but unfortunately it hasn't. They don't even see that we could sell an enterprise-level RDBMS as a significant feature - we're a webhosting company - and charge through the nose for it. Oh well. They don't listen to peons like me, they just make me fix MySQL replication every two weeks.
Re:mysql bad at disaster recovery? (Score:5, Funny)
You forgot the part where you have to take the chicken across first, because the fox won't eat the grain if you leave them alone.
Re:mysql bad at disaster recovery? (Score:5, Informative)
http://www.postgresql.org/docs/8.0/interactive/wal .html [postgresql.org]
Re:mysql bad at disaster recovery? (Score:3, Informative)
Unfortunately for webhosting the demand for MySQL is higher than for the other available DBMS's, since most available open source software and gratis software that requires a database is going to have been developed originally with MySQL. I would much prefer to be using PostgreSQL for the applications I run with a hosting provider, but the apps I use don't function with it, and the hosting provider (NearlyFreeSpeech.net) doesn't offer anything else, anyway.
I figure once the advantages of the other DBMS s
Re:mysql bad at disaster recovery? (Score:5, Interesting)
Regardless, the difficulty of the task is not the main issue. The main issue is that we are dealing with north of 1GB of data here, and on busy servers on a busy network that means restarting replication takes an hour or longer. So not only is performance reduced by 33% when we take the slaves offline one at a time, performance is reduced further by the traffic of tar/scp in the background. Not to mention the fact that, because we have a lock on the master's DB, so you can't even consider the DB cluster fully functional.
Re:mysql bad at disaster recovery? (Score:3, Insightful)
As far as the script, yes, it does have locks, and rightly so. It's not terribly tough to write a lock aware script. In my opinion, the replication setup is extremely easy to script. I'd much rather script it than sit in front of the console. Once I see it work, I know it will work every time, and I won't worry about something like me or a peer mist
Re:mysql bad at disaster recovery? (Score:3, Informative)
Nice post. I'd just like to add that Wikipedia deals with north of 170 GB, not counting images.
Re:mysql bad at disaster recovery? (Score:3, Interesting)
No. It can't. We have two concrete examples in this very page - one provided by Wikimedia, one provided by me - which directly contradict your statement. Maybe under some circumstances MySQL can handle reboots, but it's been proven already that it can't always do so. Perhaps your MySQL experience is not with high-load applications (at least not the level of load Wikimedia and my employer see).
I don't mean to diminish what you guys do, or question your abilities. I simply want to offer my perspective bec
Re:Easy, brain-dead sql db recovery (if possible) (Score:4, Interesting)
A COMPLETE log of all the SQL statements that were applied to it IN the order they were used. This is obtained by the application logging the SQL statements to the SQL log file AFTER the SQL statement is succesfully executed.
When a data base failure occurs, stop everything, 'replay' the backed up SQL logfile (thats on a separate backup system) on a copy of the empty DB there. TADA! you are back in business back to the point of failure!
Read the Wikipedia page. That's exactly what they've done, but because the MySQL database got corrupted, instead of just falling back a few minutes, they may have to go right back to a full backup and replay the log since then, which takes a lot more time than replaying a few transactions.
The solution is to switch to a database that actually implements ACID (the second letter stands for "Consistency" and the last letter stands for "Durability" which is what failed here).
Re:mysql bad at disaster recovery? (Score:2)
It's not SATA (Score:3, Informative)
Every main database server had corrupt database pages. That is, 3 systems with battery backed up write caching controlles and SCSI drives and 2 SATA systems with write caching SATA controllers, one battery backed up the other not, two different SATA disk drive makers.
Involved:
ACID (Score:3, Informative)
Re:mysql bad at disaster recovery? (Score:2)
Re:mysql bad at disaster recovery? (Score:3, Informative)
You would know this to if you had r
Re:mysql bad at disaster recovery? (Score:5, Informative)
Easily. See what those saying that MySQL can't do what MySQL does are promoting.:)
LiveJournal found that it had some disk systems which lied about having committed writes. The have a preliminary tool which copies what it's writing to disk to a networked system and then compares the after power off and recovery state to what the disk system said it could do. Are going to make it available to the community as time allows.
I expect we're going to find the same at Wikipedia. Here's a pretty typical error log, this one from the server which was master database server:
050222 5:11:12 InnoDB: Database was not shut down normally.
InnoDB: Starting recovery from log files...
InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 303 1283776146
InnoDB: Doing recovery: scanned up to log sequence number 303 1289018880
InnoDB: Doing recovery: scanned up to log sequence number 303 1294261760
InnoDB: Doing recovery: scanned up to log sequence number 303 1299504640
InnoDB: Doing recovery: scanned up to log sequence number 303 1304747520
InnoDB: Doing recovery: scanned up to log sequence number 303 1309990400
InnoDB: Doing recovery: scanned up to log sequence number 303 1315233280
InnoDB: Doing recovery: scanned up to log sequence number 303 1320476160
InnoDB: Doing recovery: scanned up to log sequence number 303 1325719040
InnoDB: Doing recovery: scanned up to log sequence number 303 1330961920
InnoDB: Doing recovery: scanned up to log sequence number 303 1336204800
InnoDB: Doing recovery: scanned up to log sequence number 303 1341447680
InnoDB: Doing recovery: scanned up to log sequence number 303 1346690560
InnoDB: Doing recovery: scanned up to log sequence number 303 1347688389
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 14 row operations to undo
InnoDB: Trx id counter is 1 935480064
050222 5:11:13 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 8617985.
InnoDB: You may have to recover from a backup.
050222 5:12:20 InnoDB: Page dump in ascii and hex (16384 bytes):
Observe that the database engine went back to its last checkpoint, noticed the partial transaction and undid it and was rolling ahead in the write-ahead log when it encountered a database page which failed its checksum test. That failed checksum test is why I think it's a problem with the disk system lying about what was written. You can get that when a database page spans two drives in a stripe set and one has committed the update while the other hasn't.
In more typical situations MySQL simply applies the updates and all is well. I've had a server set up to exceed RAM with swap turned off and get killed every ten minutes for hours and recover every time.
Just to be complete:
Re:What Happened. (Score:2, Insightful)
Kudos to Wikimedia for actually explaining what happened and not just putting a "This page is down, please try again later" messege up. Many people/companies/groups/etc would be too proud or too afraid of bad publicity to actually explain the problem.
Re:What Happened. (Score:2)
Someone please remind me again why massive databases are not yet being implemented with simple discrete file storage on ReiserFS [namesys.com]. Sure, MySQL will be faster once in memory but it sounds like the price you pay is lack of robust storage and difficult backup/recovery -- probably the most important part of running a database.
Re:What Happened. (Score:2, Insightful)
Re:What Happened. (Score:5, Funny)
Oh, maybe one, out at the guard's desk.
Re:What Happened. (Score:3, Informative)
Re:What Happened. (Score:4, Interesting)
One that complies with building and safety codes, for starters. In every jurisdiction with which I'm familiar -- admittedly not even close to all of them-- it's actually against the law to have a battery unit inside a data center cage. It's a violation of the safety code. When fire and rescue personnel go into a commercial building, they have to be sure that the power is really off. If there's a battery lying around somewhere, shorting to ground through a desk or door frame for instance, it can cause big problems.
Ask around. I bet you'll find that your data center explicitly forbids customer-installed battery units.
News Update (Score:5, Funny)
They should ask for more... (Score:3, Informative)
But if they're going to cobble together some whitebox crap servers, and not change the architecture, they'll be right back to an outage next time it happens.
Re:They should ask for more... (Score:2, Insightful)
Re:They should ask for more... (Score:4, Insightful)
Re:They should ask for more... (Score:2)
Re:They should ask for more... (Score:2)
Re:They should ask for more... (Score:2)
Re:They should ask for more... (Score:4, Insightful)
Re:They should ask for more... (Score:3, Insightful)
Re:They should ask for more... (Score:3, Informative)
Re:They should ask for more... (Score:5, Interesting)
Another indictment of MySql (Score:5, Insightful)
Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.
Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".
Re:Another indictment of MySql (Score:5, Interesting)
Barring a couple of extreme exceptions, of course a modern database system should protect integrity in the case of a power failure, or any other sudden system failure (kernel panic, GPF, whatever). In the case of the much maligned SQL Server, you can hit the power button all you want mid-transaction and you're going to get a blister on your finger before the database is corrupted.
Re:Another indictment of MySql (Score:5, Insightful)
This is false. SQL Server 2000 (yeah, I know, instant mod-down) has a transaction log and so does Oracle and I'm sure every other half-decent database. ALL committed transactions are preserved and the data is in a consistent state.
MySQL does not have this and the developers don't seem to care much about it. This is the problem with open-source in general, if someone is just doing it for fun they aren't going to spend any time on the stuff they don't care about personally.
Re:Another indictment of MySql (Score:5, Informative)
Think again. Techniques to do this have been around for years -- it's called stable storage. You just keep redundant copies of data that's changing, and use a neat and simple procedure to ensure that either they both get updated by a transaction, or the original data can be recovered. Certainly the most recent data might be lost, but there's no reason for the database to be corrupted or even in an inconsistent state.
Re:Another indictment of MySql (Score:3, Informative)
For the rest, we'll see as we get to them and, for any that fail, then look to see whether it was the disk controller or the disk drive lying about having the data written to battery backed up RAM or the disk surface.
Wikipedia hasn't suffered a day of downtime yet for this reason and looks to be d
Re:Another indictment of MySql (Score:5, Insightful)
But one didn't. That's a much more informative data point.
Re:Another indictment of MySql (Score:3, Insightful)
I am inclined to ask the database server vendor to see if they can find ways to protect against it and I've briefly discussed that already.
Re:Another indictment of MySql (Score:5, Informative)
I just love stupid trolls that can't even use Google.
Tsearch2 - full text extension for PostgreSQL [sai.msu.su]
DevX: Implementing Full Text Indexing with PostgreSQL [devx.com] - about Tsearch2.
Tsearch2 is included in the postgresql-contrib package of at least Debian and Novell/SuSE. Is that "out of the box" enough for a clueless MySQL user?
Power outages suck. (Score:2, Interesting)
I know the wikimedia folks are fundraising for more servers, but I wonder if this will provide more incentive to accept Google's offer?
More information here... (Score:5, Funny)
http://www.wikipedia.org/search?/power_
Re:More information here... (Score:5, Funny)
http://www.answers.com/topic/power-outage-1 [answers.com]
Re:More information here... (Score:2)
Join me, my friends! (Score:4, Funny)
Oh, great... (Score:3, Funny)
There's a lesson to be learned here (Score:3, Funny)
1) Get something other people love
2) Don't let them use it
3) Profit!
It doesn't hurt if you are running a fund drive at the same time, either.
ETA for read only service is now 2-4 hours. (Score:5, Informative)
I'll post followups to this post later, as we're closer to being fully recovered.
Mod parent up! (Score:2)
Re:ETA for read only service is now 2-4 hours. (Score:5, Informative)
Backup power supply? (Score:2)
Re:Backup power supply? (Score:4, Informative)
Distributed Wikipedia? (Score:2, Interesting)
Google seems to have succeeded in building a distributed platform. What about something similar to seti@home, which takes a chunk of each user's disk space and bandwidth and uses them to implement a virtual computer on which wikimedia projects may be run?
Surely someone is already working on something like this (pointers anyone??)
Re:Distributed Wikipedia? (Score:3, Insightful)
Re:Distributed Wikipedia? (Score:3, Insightful)
Re:Integrity? (Score:4, Interesting)
lame quotes rule (Score:5, Funny)
and then somebody diffed the change and made them speak again
URI to the Rescue (Score:4, Interesting)
Just cacheing copies, either actively with a redirection URL, or passively in caching backbone webservers, isn't cutting it. Caching values is always better suited to solving performance problems, creating its own concurrency and identy problems. Not to mention the publication limits of "opt-in" caches, like Coral or Google, which are an afterthought (and usually unknown) to the published object itself. Google has a huge, high-performance URL lookup system. It's taken quite a bit of value from the Internet, and all the content creators it rides on to derive all its value. It give back quite a bit, with its simple, fast, effective interface. Google is perfectly positioned to make its name truly synonymous with an Internet revolution (not just a pinnacle of search evolution) by implementing URIs. If Google let objects get looked up by a URI code as simple as say, [A-Za-z0-9]+, it could get halfway to its namesake [google.com] in objects with just 28 "digits"; just 7 digits would cover each object instance in its database right now, dozens of times over. If Google opened up such a URI protocol to anyone on the Web running such a "DIS" server, just like DNS, they could offload much of the work, avoid accusations of trying to "own the Internet", and improve their own service immeasurably, not least by making broken links in their database a quaint old curiosity. Will they rock our world, or will another big player, like Archive.org do it, before Microsoft, desperate to distinguish MSN Search, ruins it for everyone with some kind of proprietary hack that favors MS objects?
Re:URI to the Rescue (Score:3, Interesting)
Re:URI to the Rescue (Score:4, Informative)
Re:URI to the Rescue (Score:2)
Ironic (Score:2)
Re:Ironic (Score:5, Informative)
170 gigs? (Score:2)
I wonder if wikimedia will ship the whole wikipedia on a few bzipped DVD isos to people who want a not-so-up-to-date encyclopaedia. I was researching a period of 1200AD, not much chance that data will change in the next few months.
And I DO wonder why doesnt another database company take up a mirror of wikipedia, just to show the reliability, speed, scalability etc of their database.... great marketing tool especially if you own al
Re:170 gigs? (Score:3, Interesting)
The vast majority of this space is taken up by revision histories (and those are compressed!) Periodic database dumps [wikimedia.org] are available for download. Image and multimedia uploads have been taking up a bigger share lately, but those are on a separate server which recovered just fine.
A German company has published an end-user-friendly CD-ROM of material from the German-language Wikipedia, but afaik no one's published an English-language edition yet.
Answers.com (Score:4, Informative)
What's the Name of Wikimedia's Colo? (Score:2)
Ron
Absolute power corrupts. (Score:3, Funny)
why, why, why? (Score:3, Insightful)
This is a negligible cost for something as integral as an active sync with the work that people have performed - for free.
Why is this not seen as important? "The wiki users will just recreate the material"? That's somewhat presumptuous.
Now, livejournal I can understand not doing this (as there are many clients which allow people to sync with their online journals and the material is fairly culturally worthless), but wikipedia? It's one of the better things on the Internet.
Oh...ok... (Score:3, Funny)
"Slashdot...outage reports for nerds! Stuff that doesn't matter to me!"
Lol!
-buf
MySQL not ACID (Score:3, Insightful)
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers. (Yes, even the machines with dual power supplies -- both circuits got shut off.)
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state.
(Bolding mine.) This proves that MySQL is not ACID, there is no way that a power outage is supposed to cause corruption in a database. This is not a troll, this is a simple conclusion. I really think that Wikipedia should switch to PostgreSQL, which is considerably more mature in terms of ACID compliance.
Taken down by CO$, coincidence or not! (Score:3, Funny)
He thinks I'm a god now!
perhaps I just inadvertently reached clear
Notice thier Database worries (Score:3, Funny)
Apparently one of their MySQL databases got corrupted as well. Figures. You'd think with all that volume they'd be wise enough to use a DB that can withstand a hard powercycle without losing data.
Just remember, friends don't let friends use MySQL for important data.
:::eyes UPS under table::: (Score:5, Funny)
:::eyes my UPS::::
::::ponders for a momment::::
:::eyes the serial cable that gracefully shuts down said computer in the event of a power failure::::
:::ponders some more::::
:::eyes the spare UPS sitting in the corner that used to be connected to a database server::::
Hmm, I think i'm almost onto something here, but i just can't seem to nail it down...
-Chris
Proper fundraising link (Score:4, Informative)
http://wikimedia.org/fundraising/ [wikimedia.org]
Latest news (Score:5, Informative)
From: Brion Vibber
Reply-To: wikipedia-l@wikimedia.org
To: Wikipedia-l, Wikimedia Foundation Mailing List, Wikimedia developers
Date: Tue, 22 Feb 2005 04:47:56 -0800
Subject: Re: [Wikipedia-l] Wiki Problems?
Brion Vibber wrote:
> There was some sort of power failure at the colocation facility. We're
> in the process of rebooting and recovering machines.
The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.
Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)
The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.
The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.
I don't know when exactly we'll have everything editable again, but it should be within 12 hours.
Not exactly (Score:2)
Homer: Me Homer, I'm running from PBS.
Re:Arghh (Score:2)
Re:Where are you guys hosting from? (Score:2, Informative)
Re:Stupid question... (Score:3, Informative)
Re:Stupid question... (Score:3, Insightful)
Re:Xenu Strikes Again! (Score:5, Informative)
Gee, you just had to mention the X-word! Now this thread won't load for most Scientologists because the keyword filters they were forced to install by their Church will see "Xenu" and block the site. After all the mere sight of the word could cause "pneumonia and death" if you haven't paid the Church of Scientology for the proper preparation.
Wikipedia's Xenu article has an interesting history if you look, as I did the other night when it was featured. Scientologists vandalize it regularly. You're supposed to pay them a half million (or some absurd sum of money) to find out about Xenu. After you find out, you're too embarrassed to admit to anybody that you paid a half million to learn that your problems are caused by bad science fiction, when you could have bought a house in Silicon Valley instead. So they obviously don't want a Wikipedia article giving away their half-million-dollar "trade secret" for free.
One trick I saw was to use HTML entities to spell out insults at the top of the article- like "only an idiot would believe this" or something. In the editor window, the entities weren't rendered and each letter appeared as a hex code.
A more effective attack took a different approach. The vandal in this case changed "Scientologists" to "Muslims", "Scientology" to "Islam", and inserted a boring-sounding sentence at the end of the first paragraph claiming that "Xenu" is another name that Muslims use for "Allah". It completely discouraged you from reading further. If you didn't know better you wouldn't find out how "Allah" distributed the thetans around volcanoes on various planets and blew them up with hydrogen bombs, and how their blown-up spirits cause problems in your personal life today.
This is OT, but what the hell, why not whack a beehive? Additional information on Xenu:
Operation Clambake [xenu.net] (Hubbard maintained that humans are descended from clams)
The Xenu leaflet [xenu.net] (all about Xenu- this information can save you lots of $$$$$)
The road to Xenu [chello.nl] (authored by a woman who got suckered)
The Google cache of Wikipedia's Xenu article [216.239.57.104] is also a must read.
I'm wondering if I'll get a lot of freaks, downmoderations, and hostile AC replies after I post this. After all, that's the kind of thing that Hubbard called "fair game". [keithhenson.org] If it sinks below default visibility I'll repost it again with my karma bonus, so you theta-clear-wannabes out there can save your points for someone else.
Re:Xenu Strikes Again! (Score:3, Interesting)
Oh well.. Slightly OT