Major Outage At the Amazon Web Services 247
ralphart writes "The Northern Virginia datacenter for Amazon Web Services appears to be having a major outage that affects EC2 services. The Amazon Forums are full of reports of problems. Latest update from the status page: 2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution."
No Way! (Score:5, Funny)
Re: (Score:3)
Re: (Score:2)
But it's not supposed to happen, because "if" (when!) it does, the impact is HUMONGOUS. "You're welcome to store all your data in our fast, easy and safe cloud storage. Downtime? Don't worry, it'll only experience hour long outages intermittently." Yeah, that's how they sold it in the first place, isn't it?
This will become quite the event in data warehouse circles I bet, because the cost of 'being in the cloud' just doubled; it's not enough to buy storage from one provider. The "always there" quality that's
Re: (Score:3, Insightful)
This will become quite the event in data warehouse circles I bet, because the cost of 'being in the cloud' just doubled; it's not enough to buy storage from one provider. The "always there" quality that's supposedly the benefit of cloud storage is a facade.
You can buy from one provider -- every major cloud provider has multiple availability zones. But yes, lots of people buy in only one zone because it's cheaper, and then suffer for that mistake -- in situations just like this.
Re:No Way! (Score:5, Insightful)
his will become quite the event in data warehouse circles I bet, because the cost of 'being in the cloud' just doubled; it's not enough to buy storage from one provider. The "always there" quality that's supposedly the benefit of cloud storage is a facade.
The cloud doesn't have to be perfect - it just has to be as good in the eyes of VPs as the contractors they'd otherwise hire to run their internal datacenter. What's the value of an IT guy in the eyes of an MBA? Yeah, this sort of reality check wont phase them at all.
Re: (Score:2)
We use a number of providers which means that even if Amazon fell over completely our systems would be fine -- it looks like a lot of sites (reddit, for instance) don't bother to do this.
Re: (Score:2)
It'd be an increase in traffic but not necessarily a huge increase in load (since they wouldn't be generating HTML at the second site unless they're in failover mode).
I don't know whether the increased reliability would be worth the extra load in their case, however, since I doubt they lose that much money from downtime (given how frequently they're dow
Re: (Score:2)
But how can this be possible? It's The Cloud . This sort of this simply doesn't happen.
To be fair to Amazon - on a good cloud (incl. Amazon's) you can launch instances in completely different data centers, so your most critical services have somewhere to fail over to.
Though, personally I'd feel even better if my nodes were distributed across two different clouds; to avoid the single-point-of-failure of the Amazon account itself. For example, despite running in both their East and West data centers, I'm still vulnerable to a sales/billing miscommunication that freezes my whole account.
Re: (Score:2)
Each data center also has independent zones.
It looks like in this case, only one zone in one data center was affected-- that's bad, but that's not "end-of-the-world" bad. If sites are going down, they should have been more careful to distribute redundant servers in different zones.
(Where this is a problem is if you're a small shop with a single DB server, and the zone holding your DB server goes down-- in that case you're kind of SOL.)
Re: (Score:2)
As an example, we run our production servers on EC2 East; they have load balancers failing them between zones. The Database and webservers are fine, and have been fine today.
The dev servers do not have load balancers running on them, and they have been choking in a miserable hell all morning.
Re: (Score:2)
(Where this is a problem is if you're a small shop with a single DB server, and the zone holding your DB server goes down-- in that case you're kind of SOL.)
IMHO the main beauty of a cloud is that you're NOT SOL.
For one of the sites I manage, I am a small shop.
The beauty of a cloud is that with Amazon's $0.02/hr micro instances, and $0.007 spot-priced micro instances I can *still* do things right (failover to remote data center, backups in different data center), even for clients that can only afford under $50/month in hosting.
Re: (Score:2)
This sort of this simply doesn't happen.
Now we know: All it takes is one admin screwing up and replacing an "ng" with an "s".
Re: (Score:2)
But how can this be possible? It's The Cloud . This sort of this simply doesn't happen.
Yay, cloud!
Re:No Way! (Score:5, Informative)
A major outage on most professional cloud setups means it is down for a few hours. A major outage at work means the full day. It is like saying driving my car is so much safer then flying because I never got into an accident.
Last time I remember a day-long outage at work was 1994, and that was because the license server failed so we couldn't run our own software (we couldn't recompile it to remove the DRM because the compiler also needed a license to run).
I seem to remember that the Mac guys at the company also had a long outage when they couldn't connect to one of their Mac servers, but eventually someone actually went to the server room and discovered that it had been stolen.
Back on topic, I just don't see all these day-long outages that apparenty seem to happen all the time in companies that haven't moved their servers to The Cloud(tm).
Re: (Score:3, Insightful)
"Back on topic, I just don't see all these day-long outages that apparenty seem to happen all the time in companies that haven't moved their servers to The Cloud(tm)."
You must not get out much. Atlantic.net I had a 11 hour outage due to the staff not understanding how to update a Cisco router. Then a 4 hour outage when they screwed up billing and shut down our service with no warning. Then there was that time they didn't like our DNS traffic and shut down DNS with no warning or notice. That was a fun hour o
Re: (Score:3)
We were out for a good portion of the day Monday after a bird flew into the telephone pole outside our office and then caused a critical server to go wonky after the UPS battery ran out and we didn't have the auto-shutdown settings correct.
Re: (Score:3)
But when it's your gear, you have some control over the situation. When it's "in the cloud", you sit and get yelled at by the CXO and sweat if you'll still have a job while cloud provider X works to fix the problem (and their liability? whatever you paid for the service).
Re: (Score:2)
Annoying > Shitcanned.
Severe weather in Virginia likely the culprit (Score:3, Informative)
Re:Severe weather in Virginia likely the culprit (Score:5, Informative)
Re: (Score:2)
it's true: http://www.examiner.com/progressive-in-richmond/surry-power-station-under-repair-the-aftermath-of-tornado [examiner.com]
Tornado was Saturday. I live on the other side of the James River from Surry.
Re: (Score:2)
Yeah, that may be true, but it has nothing to do with anything going on in Northern Virginia. Surry is in Southeastern Virginia, over 150 miles away.
Re: (Score:2)
Well, that just about sums up the attitude of Northern Virginia towards the rest of the state.
Re: (Score:2)
There's a "rest of the state?" :)
(Also in NoVA, no outages or severe weather here)
Re: (Score:3)
Those news reports do not rule out the possibility that he's in a place in Northern Virginia without severe weather or a power outage. How do you conclude that he is wrong?
Re: (Score:2)
N. Va is not really that big. All the article cited talk about VA, not NVA.
Re: (Score:2, Informative)
First: Please look at a map. Surry County is east of Richmond on the way to VA Beach. An outage at Surry Power Station would not affect a data center over in Dulles, VA. That power station does not server this area at all.
Second: Read the news. Every comment above is wrong in one way or another. Here is a local news article about what happened down there, if you are curious:
http://www.examiner.com/progressive-in-richmond/surry-power-station-under-repair-the-aftermath-of-tornado
You people know nothing,
Re: (Score:3)
Re: (Score:3)
Amazon's comments on the outage do not mention weather as a cause: http://status.aws.amazon.com/ [amazon.com]
"8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, whi
Re: (Score:2)
Re: (Score:2)
Your cloud computer is a Xen instance in Virginia, and your "EBS block storage" is an iSCSI target. Magic it ain't.
Re: (Score:2)
Re: (Score:2)
Not really half-assed from an implementation perspective, but from a marketing perspective. Amazon likes people to think it's magic, which is fine if it worked flawlessly all the time. But it doesn't, because it's just a technical solution for a specific problem. Unless you run instances in multiple zones, use redundant EBS volumes, and your entire app is built to handle global redundancy, it's not just going to be 100% uptime out of the box. I fault Amazon for lying to technical-enough people.
Re: (Score:2)
"Essentially half-cloudassed clouding."
EC2 is just tools. It's as cloudassed as you make of it.
I can take ESX and use a Netapp for data storage and if my Netapp cluster takes a dive, you can't fail over to anything since your data is down.
On the other hand I can take EC2 and run apps and clustered DBs across the east and west coast and put ELB on front of it. If the east coast takes a nuke, everything will keep on running.
Re: (Score:2)
.Your cloud computer is a Xen instance in Virginia, and your "EBS block storage" is an iSCSI target. Magic it ain't.
There is no room for accurate or useful specifications in the flamboyant, misrepresentation of marketing. Please enjoy the cuddly puppies and warm fuzzys.
Re: (Score:2)
Severe weather hit the area. They shutdown Surry Power Station in Surry County, Virginia after a tornado took the power out that powers the power station.
Of course we all know that the not-cloud would have been impervious to that.
Re: (Score:2)
But the scanner says their power level is Over 9000!
Re: (Score:2)
Re: (Score:2)
it was probably a distribution station, not a power generation facility.
Re: (Score:2)
Re: (Score:2)
after a tornado took the power out that powers the power station.
Does not compute. Once it's running why can't a power station use it's own power.
Because you tend to want to have power available to cool nuclear fuel even if you decide to stop producing power for whatever reason (maintenance, mechanical failure, tornado, earthquake, tsunami, nazi zombi attack)
Re: (Score:2)
In which case being unable to use a secondary source (self-generated power) would be a bad thing, no?
Re: (Score:2)
Re: (Score:2)
News reports are spotty, but I imagine that the plant tripped the turbines offline after the tornado damaged the power distribution equipment.
When it's generating 1GW of power and suddenly the load goes down to 0GW, the turbines have to trip offline automatically and immediately to prevent damage.
This may have also triggered a shutdown of the nuclear reactor, and it may take days or longer to bring it online after an emergency shutdown.
bean counts screw us again! (Score:2)
Re: (Score:2)
No, they're not (see Fukashima, Japan). Basically, you don't just flip a switch and have a power plant go dark; you have to follow a shutdown procedure that takes both time and power. I don't know the requirements for coal or natural gas plants, but US nuclear plants are required to have multiple backup power sources (IIRC at least two independent diesel generator systems as well as off-site power). If the plant loses one backup power source for more than a certain period, it is required to shut down. I
Re: (Score:2)
Only due to cost savings. The tsunami wall required was half the height required (6M instead of 12M). Naturally, a 10M high tsunami hit. And no placement of the generators would've helped (they were in the basement, and that got flooded, but if they were outside, they could've gotten washed away).
Re: (Score:3)
The real problem everywhere...and I do see it everywhere...is that the people paid to be the people that 'know' simply don't know, or have no sense of creativity or foresight. I mean come on, they built a tsunami wall because they have a high probability of tsunamis, and then they go and put the most missio
Lucky (Score:2)
Re: (Score:2)
Re: (Score:2)
Really? What's the purpose of that? Some kind of half-assed based-on-human-psychology load balancing?
My servers are in us-east-1d as well, and they didn't go down, but maybe that's just dumb luck as my 1d is your 1b.
I can't do a really redundant setup, though, because I need a MS SQL instance and we don't have the budget for a second one to mirror to, so ... if the zone with our MS SQL instance goes down, or app is sunk regardless of how distributed the web servers are.
Re: (Score:2)
The dark side of outsourcing (Score:3)
Slashdot and Digg have one day traffic surges because Reddit is down. I'm getting way too much done today not being distracted by the GoneWild girls. This productivity must cease at once!
Does go to show what can happen when your business depends on an outsource provider. Everyone has to depend on service providers to some extent, but sometimes it's a good exercise to see how many of your company eggs are in one basket. Redundancy is expensive, but so is losing business. Even Google has had Gmail interruptions, lost some customer data and experienced slow downs.
Give me my Reddit back! (Score:2)
Emergency Plan (Score:5, Interesting)
I didn't even realize that one of our partners was using Amazon EWS until suddenly they were down all day. Amazon is really stable historically, but it's frustrating when you're out of business and all you can do is wait and see if Amazon will fix it soon.
In the "old school" thinking, smart companies have a redundant data center somewhere, humming along and waiting to be switched on if the main data center ever goes down. "The cloud" was supposed to solve that - massive redundancy within Amazon's services were supposed to protect you from outages. Not the case, apparently, since it looks like Amazon is going to fall below their promised 99.95% uptime (4.38 hours per year downtime).
I think the answer is to have redundant cloud services online, so you could switch from Amazon to Google or DevGrid if you had issues. The problem is, there's nothing quite like Amazon right now, it's not easy to switch from Amazon to some random service. This might be the biggest argument against virtual services - lack of standardization makes it hard to move from one to another, and hard to set up backup services in case of emergency.
Re: (Score:3, Insightful)
Re: (Score:3)
Actually, I'm more concerned about the *organization* as a single point of failure. If you rely on, say, Oracle (ugh), and Oracle goes bankrupt or a court orders them to stop selling their database or they simply decide to stop supporting some feature, you're still in business, and have a pretty good shot at moving to some similar database management system.
If you built a mission critical system on Amazon's cloud services, a single court order not aimed at you could put you out of business. If Amazon was
Re: (Score:3)
Just using Amazon West as well as Amazon East would have saved customers from this outage.
I think Amazon actually does great at covering all the technological single-points-of-failure.
The only reason I'd want a second cloud vendor is for the sales/account related single-point-of-failure of the Amazon Account being frozen due to a sales miscommunication or a MPAA/RIAA takedown notice,etc.
Re: (Score:2)
"In the "old school" thinking, smart companies have a redundant data center somewhere, humming along and waiting to be switched on if the main data center ever goes down. "
The problem is that gets really really expensive and it's actually quite hard to do properly.
You can do this with EC2 though, just have your application cross various geographical zones. Things like ELB even make this somewhat easier. But you still have to solve all the application problems that exist when your data stores exist across la
Re: (Score:2)
Re:Emergency Plan (Score:4)
This is why banks or large businesses, for instance, have two or more data centers they always keep synchronized and have at least 50 kilometers between them. Thinking "well it's in one AWS data center so it's safe" is wrong, and this incident is a fine example of that.
Re: (Score:2, Informative)
50km is not a far enough distance. I witnessed this first hand for the employer I worked for on the Gulf Coast during Katrina. That storm jacked up about 120 miles, took down our primary AND failover sites.
Re: (Score:2)
Amazon have complete isolation between Regions and good isolation between Availability Zones.
At work we'd recommend people use 2 cloud providers for their important services (which could be 2 Amazon regions or it could be Amazon and Rackspace) to prevent this sort of failure taking your business offline. You can't rely on any particular cloud provider to be reliable but it's a reasonably safe bet that a selection of cloud providers won't have significant overlapping downtime
It's also worth pointing out tha
Re: (Score:3)
It's also worth pointing out that all cloud SLAs are basically useless: if Amazon falls below their advertised uptime they'll refund you some of your charges - but they'll never refund more than what you've paid them: they don't compensate you for all the money you're losing (and the AWS charges are likely pocket change compared to this)
FYI, I don't think this outage even falls under EC2's SLA. The Region was still technically on line. Only EBS was down.
Granted, many customers depend heavily on EBS, but the SLA doesn't cover an outage in just one specific EC2 feature. That being said, I wonder if AWS will honor SLA claims anyway, as a PR move. This outage is just so clearly Amazon's fault: a network hiccup causes EBS to overload in one Availability Zone, which cascades into all Availability Zones in the Region.
Personally, I think that they
Not so bad.. (Score:2)
Judgement Day (Score:2)
6 weeks before the AWS summit 2011 (Score:4, Interesting)
Gotta wonder what kind of flack Amazon is going to take for this one. I've had a couple clients looking into cloud services including moving to AWS and have already had one of them call me and cancel a meeting about it. While I understand stuff happens, the entire sales pitch for AWS was redundancy and build as you grow. Redundancy has obviously not worked in this case, while I usually support cloud services, this is definitely going to be a hard example to counter when trying to sell it to potential customers.
Re: (Score:2)
"Redundancy has obviously not worked in this case"
Only 1 region is effective. If your app was set to work with multiple zones then it likely wouldn't be impacted by this outage.
The thing with EC2 is it gives you the tools to build complex clusters. It doesn't do it for you.
Re: (Score:3)
Only 1 region is effective. If your app was set to work with multiple zones then it likely wouldn't be impacted by this outage.
Not true. My application works just fine in multiple Availability Zones, yet it was knocked out yesterday due to an entire Region getting knocked offline.
And before you tell me that the application should have been multi-Region, I'm not buying it. AWS has always maintained that deploying an app across multiple AZs is HA. AZs are supposed to be considered as separate datacenters: separate power, separate uplink, etc. And yes, separate EBS infrastructure (you can't attach an EBS volume to an instance that was
Re:6 weeks before the AWS summit 2011 (Score:5, Informative)
It's not short sighted at all. When someone else runs your gear, all you can do is sweat until they get things back online, and they can take their time under what's known as "commerically reasonable SLAs". When you own your own gear, your own colo, etc., how much effort you use to get back up and running is up to you.
"The Cloud" for mission critical businesses is a joke.
Re: (Score:2)
For a small or medium size business, it could very well take massive amounts of effort and cost to keep your servers going full time. For many people it probably makes sense to outsource that function to dedicated engineers, rather than having to hire and manage your own.
Re: (Score:2)
Re: (Score:2)
Dude, I used to help run a Tier-1 CMS data facility for the LHC. I've done IT for the better part of 14 years. I know exactly what the fuck is going on here. Amazon sells people on the fact that you "put everything in the cloud" and you won't have any problems. Then problems occur and it's all *shrugs, shit happens*.
Fark. off.
Re: (Score:2)
I understand that but it still makes it a hard sell in the short-term until this all blows over.
Soo... (Score:2)
Is anybody else suffering from Reddit withdrawal?
Inappropriate metaphor - the cloud (Score:2)
Re: (Score:2)
TMZ is one site. (Co worker's main news site) (Score:2)
Their error page is rejected by firefox. So I wgetted it to see why.
At the bottom is a script from RUSSIA (in my best Max Headroom voice) (the src is addonrock.ru/Templatel.js)
So perhaps AWS is hacked?
to the cloud! (Score:2)
Coincidence, PSN? (Score:2)
Does Sony's PSN sublet capacity on Amazon's cloud? PSN is down for "a day or two" according to stuff on Google.
Re:Reddit is down because of this (Score:5, Funny)
Re: (Score:2)
You're posting on Slashdot, so I believe you already found the answer.
Yeah but maybe he's hungry for news.
Re: (Score:2)
It's not the same! I want atheists being smug about not believing in god (and refusing to capitalise), and liberal lefties telling each other that the government needs to be more liberal while Libertarians accuse them of worshipping Obama, photoshopped pictures that have been debunked dozens of times before, people claiming to be things that they aren't and answering questions, and hero worshipping of Ron Paul!
You just don't get enough of that here.
You said it ... we can't post pictures here (for which those of us here in the Goatse spam days were quite thankful).
Re: (Score:2)
Productivity in Offices will reach record levels today.
Re: (Score:2)
They took Digg down last year and replaced it with this horrible monstrosity they called 'v4' or something. It's a shame they just took such a popular site offline and haven't provided a decent replacement.
Re:Reddit is down because of this (Score:5, Informative)
Re: (Score:2)
That even happens if you just click on posts. Not to mention that the comment scores are sometimes hidden randomly and you have to do all the clicking till you see them.
Re: (Score:2)
There's an easy fix for that: block javascript and turn on classic discussion mode. Not only will /. actually work, it'll feel 10 times faster!
Re: (Score:2)
People still go to digg? Oh, I see what you did there.
I actually went to Digg this morning since Reddit is down. I haven't been in months since I removed them from my RSS reader. All I have to say is "ouch". Front page stories with a whopping 5 comments? Its pretty sad.
Re: (Score:2)
This is the first time I've been back here in a while. I decided to try it when I realized reddit's downtime is probably going to be a while. I still feel a reverence for this place. It sort of reminds me of going back and visiting my university.
Digg can rot in hell.
Re: (Score:2)
Digg never had much comment activity when compared to similar sites (Slashdot, Reddit, etc.). Which is a shame, because the comments are usually more entertaining than the actual links.
Re: (Score:2)
You'd get an upvote, but I haven't seen mod points in a long time...
Re: (Score:2)
In about the time it took you to write that message I spun up a standby deployment in another data center smart guy.
Re: (Score:2)
How long does it take you to have the IP addresses rerouted?
Re: (Score:2)
eip's move in seconds but in my use case I do not need eip's since a front end is handing off the requests to the cloud systems.
Re: (Score:2)
Really? Wow. Perhaps you should let major sites like Reddit know. They've been down for *hours*.
The cloud works if you don't care about having control over when your business is down.
Re: (Score:2)
Really? Wow. Perhaps you should let major sites like Reddit know. They've been down for *hours*.
The cloud works if you don't care about having control over when your business is down.
Last time I had a physical DC go down it was a cooling failure. Didn't have much control over that either.
Moreover, with a cloud vendor I can have servers in multiple sites with different power, connectivity, and geographic location without massive investment in each.
Re: (Score:2)
Amazon has "availability zones" for a reason, as do other cloud vendors.
If your infrastructure isn't resilient against everything in a zone suddenly disappearing, you're Doing It Wrong.
Re: (Score:2)
I understand the need for physical availability zones, but the whole idea behind the cloud is that you, the end user, need not care about those details. It is up to the cloud provider to figure it out. The cloud represents a black box, of sorts. If they are having trouble in one zone, everything should automatically migrate to another without anyone outside of the operation knowing it.
I'm not saying Amazon's solution is bad, but I'm not sure it is in the spirit of what I would consider real cloud hosting. R
Re: (Score:2)
Scalability: yes.
Cheap: yes.
Reliability: they don't say they are 100% fail safe. I think the figure is still in the 90's though which is pretty good.
If anyone tries to sell you 100% they are liars.