Forgot your password?
typodupeerror
Cloud

Major Outage At the Amazon Web Services 247

Posted by CmdrTaco
from the but-the-cloud-fixes-everything dept.
ralphart writes "The Northern Virginia datacenter for Amazon Web Services appears to be having a major outage that affects EC2 services. The Amazon Forums are full of reports of problems. Latest update from the status page: 2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution."
This discussion has been archived. No new comments can be posted.

Major Outage At the Amazon Web Services

Comments Filter:
  • No Way! (Score:5, Funny)

    by Frosty Piss (770223) * on Thursday April 21, 2011 @11:48AM (#35894908)
    But how can this be possible? It's The Cloud . This sort of this simply doesn't happen.
    • by alphatel (1450715) *
      It didn't happen. The cloud can erase history in a planck!
    • by pdbaby (609052)
      Jokes aside, if people use The Cloud (I'm using this tongue in cheek...) rather than a cloud this thing doesn't happen.
      We use a number of providers which means that even if Amazon fell over completely our systems would be fine -- it looks like a lot of sites (reddit, for instance) don't bother to do this.
    • by ron_ivi (607351)

      But how can this be possible? It's The Cloud . This sort of this simply doesn't happen.

      To be fair to Amazon - on a good cloud (incl. Amazon's) you can launch instances in completely different data centers, so your most critical services have somewhere to fail over to.

      Though, personally I'd feel even better if my nodes were distributed across two different clouds; to avoid the single-point-of-failure of the Amazon account itself. For example, despite running in both their East and West data centers, I'm still vulnerable to a sales/billing miscommunication that freezes my whole account.

      • by Blakey Rat (99501)

        Each data center also has independent zones.

        It looks like in this case, only one zone in one data center was affected-- that's bad, but that's not "end-of-the-world" bad. If sites are going down, they should have been more careful to distribute redundant servers in different zones.

        (Where this is a problem is if you're a small shop with a single DB server, and the zone holding your DB server goes down-- in that case you're kind of SOL.)

        • by watanabe (27967)

          As an example, we run our production servers on EC2 East; they have load balancers failing them between zones. The Database and webservers are fine, and have been fine today.

          The dev servers do not have load balancers running on them, and they have been choking in a miserable hell all morning.

        • by ron_ivi (607351)

          (Where this is a problem is if you're a small shop with a single DB server, and the zone holding your DB server goes down-- in that case you're kind of SOL.)

          IMHO the main beauty of a cloud is that you're NOT SOL.

          For one of the sites I manage, I am a small shop.

          The beauty of a cloud is that with Amazon's $0.02/hr micro instances, and $0.007 spot-priced micro instances I can *still* do things right (failover to remote data center, backups in different data center), even for clients that can only afford under $50/month in hosting.

    • by dkleinsc (563838)

      This sort of this simply doesn't happen.

      Now we know: All it takes is one admin screwing up and replacing an "ng" with an "s".

    • But how can this be possible? It's The Cloud . This sort of this simply doesn't happen.

      Yay, cloud!

  • by stopacop (2042526) on Thursday April 21, 2011 @11:50AM (#35894944) Homepage
    Severe weather hit the area. They shutdown Surry Power Station in Surry County, Virginia after a tornado took the power out that powers the power station.
    • by getagrip (86081) on Thursday April 21, 2011 @11:59AM (#35895128) Homepage
      I am in Northern Virginia. There is no power outage or severe weather here.
      • by Wornstrom (920197)

        it's true: http://www.examiner.com/progressive-in-richmond/surry-power-station-under-repair-the-aftermath-of-tornado [examiner.com]

        Tornado was Saturday. I live on the other side of the James River from Surry.

        • by Drathos (1092)

          Yeah, that may be true, but it has nothing to do with anything going on in Northern Virginia. Surry is in Southeastern Virginia, over 150 miles away.

      • Well, that just about sums up the attitude of Northern Virginia towards the rest of the state.

        • Well, that just about sums up the attitude of Northern Virginia towards the rest of the state.

          There's a "rest of the state?" :)

          (Also in NoVA, no outages or severe weather here)

    • by pdbaby (609052)
      Amazon's Availability Zones are designed to have separate power, cooling and network so I don't think this is the issue. It was (is) a problem with their disk subsystem in multiple availability zones so I suspect they were in the process of pushing out some new storage controller code and some bug didn't appear until the later stages of their rollout. From their status log it looks like they're manually correcting the issue with each disk.
    • Amazon's comments on the outage do not mention weather as a cause: http://status.aws.amazon.com/ [amazon.com]

      "8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, whi

    • by alphatel (1450715) *
      So they can't failover like a normal ESX instance? So my cloud computer is actually just a rack in Virgnia?
      • Your cloud computer is a Xen instance in Virginia, and your "EBS block storage" is an iSCSI target. Magic it ain't.

        • by alphatel (1450715) *
          Essentially half-cloudassed clouding.
          • Not really half-assed from an implementation perspective, but from a marketing perspective. Amazon likes people to think it's magic, which is fine if it worked flawlessly all the time. But it doesn't, because it's just a technical solution for a specific problem. Unless you run instances in multiple zones, use redundant EBS volumes, and your entire app is built to handle global redundancy, it's not just going to be 100% uptime out of the box. I fault Amazon for lying to technical-enough people.

          • by Synn (6288)

            "Essentially half-cloudassed clouding."

            EC2 is just tools. It's as cloudassed as you make of it.

            I can take ESX and use a Netapp for data storage and if my Netapp cluster takes a dive, you can't fail over to anything since your data is down.

            On the other hand I can take EC2 and run apps and clustered DBs across the east and west coast and put ELB on front of it. If the east coast takes a nuke, everything will keep on running.

        • by tunapez (1161697)

          .Your cloud computer is a Xen instance in Virginia, and your "EBS block storage" is an iSCSI target. Magic it ain't.

          There is no room for accurate or useful specifications in the flamboyant, misrepresentation of marketing. Please enjoy the cuddly puppies and warm fuzzys.

    • Severe weather hit the area. They shutdown Surry Power Station in Surry County, Virginia after a tornado took the power out that powers the power station.

      Of course we all know that the not-cloud would have been impervious to that.

    • But the scanner says their power level is Over 9000!

    • by Jawnn (445279)
      Wow, then it's understandable. Good thing they weren't running a nuclear power plant or something.
  • My instance is on us-east-1d which is still up.
    • by pdbaby (609052)
      Their API gives different names for the availability zones for each user (so your us-east-1d could be my us-east-1a) which complicates talking about issues (since all you can say is "two availability zones are experiencing problems"), especially when your system uses multiple accounts
      • by Blakey Rat (99501)

        Really? What's the purpose of that? Some kind of half-assed based-on-human-psychology load balancing?

        My servers are in us-east-1d as well, and they didn't go down, but maybe that's just dumb luck as my 1d is your 1b.

        I can't do a really redundant setup, though, because I need a MS SQL instance and we don't have the budget for a second one to mirror to, so ... if the zone with our MS SQL instance goes down, or app is sunk regardless of how distributed the web servers are.

        • by pdbaby (609052)
          Yeah, I think that's what they're trying to do. I suppose it makes sense in a way, they want to make sure load is evenly distributed across their availability zones . But it seems to me they could have prevented that through better API design (e.g. users expressing a constraint that 2 resources should be in the same zone where that's meaningful but otherwise not permitting the selection of a specific zone)
  • by HangingChad (677530) on Thursday April 21, 2011 @12:03PM (#35895224) Homepage

    Slashdot and Digg have one day traffic surges because Reddit is down. I'm getting way too much done today not being distracted by the GoneWild girls. This productivity must cease at once!

    Does go to show what can happen when your business depends on an outsource provider. Everyone has to depend on service providers to some extent, but sometimes it's a good exercise to see how many of your company eggs are in one basket. Redundancy is expensive, but so is losing business. Even Google has had Gmail interruptions, lost some customer data and experienced slow downs.

  • Else I don't know what to do? I almost went to Digg! so please amazon guys, work on your stuff!
  • Emergency Plan (Score:5, Interesting)

    by sycorob (180615) on Thursday April 21, 2011 @12:11PM (#35895392)

    I didn't even realize that one of our partners was using Amazon EWS until suddenly they were down all day. Amazon is really stable historically, but it's frustrating when you're out of business and all you can do is wait and see if Amazon will fix it soon.

    In the "old school" thinking, smart companies have a redundant data center somewhere, humming along and waiting to be switched on if the main data center ever goes down. "The cloud" was supposed to solve that - massive redundancy within Amazon's services were supposed to protect you from outages. Not the case, apparently, since it looks like Amazon is going to fall below their promised 99.95% uptime (4.38 hours per year downtime).

    I think the answer is to have redundant cloud services online, so you could switch from Amazon to Google or DevGrid if you had issues. The problem is, there's nothing quite like Amazon right now, it's not easy to switch from Amazon to some random service. This might be the biggest argument against virtual services - lack of standardization makes it hard to move from one to another, and hard to set up backup services in case of emergency.

    • Re: (Score:3, Insightful)

      by MariusBoo (883340)
      Actually in the case of EC2 the smart thing would have been to have your instances spread over different availability zones...
      • by hey! (33014)

        Actually, I'm more concerned about the *organization* as a single point of failure. If you rely on, say, Oracle (ugh), and Oracle goes bankrupt or a court orders them to stop selling their database or they simply decide to stop supporting some feature, you're still in business, and have a pretty good shot at moving to some similar database management system.

        If you built a mission critical system on Amazon's cloud services, a single court order not aimed at you could put you out of business. If Amazon was

    • by ron_ivi (607351)

      Just using Amazon West as well as Amazon East would have saved customers from this outage.

      I think Amazon actually does great at covering all the technological single-points-of-failure.

      The only reason I'd want a second cloud vendor is for the sales/account related single-point-of-failure of the Amazon Account being frozen due to a sales miscommunication or a MPAA/RIAA takedown notice,etc.

    • by Synn (6288)

      "In the "old school" thinking, smart companies have a redundant data center somewhere, humming along and waiting to be switched on if the main data center ever goes down. "

      The problem is that gets really really expensive and it's actually quite hard to do properly.

      You can do this with EC2 though, just have your application cross various geographical zones. Things like ELB even make this somewhat easier. But you still have to solve all the application problems that exist when your data stores exist across la

      • by mikeytag (1835928)
        Nail on the head here. We were affected today and while I have full offsite backups of everything we don't have a second datacenter to switch on because of cost and complexity. It's not too difficult to have webservers span different parts of the globe, but DB servers like MySQL are a whole different story and usually very crucial.
    • by Alarash (746254) on Thursday April 21, 2011 @12:31PM (#35895772)
      Even by using only AWS you can set up redundancy across multiple North America's regions. Even across continents, with one data center in Ireland and one in Singapore. But obviously it costs extra as they bill you the bandwidth between the regions. That's how you use The Cloud (c) (tm) (R). Using a single data center to set up redundancy is dumb because it's not redundancy. You need high availability for your VMs, but also for your data center.

      This is why banks or large businesses, for instance, have two or more data centers they always keep synchronized and have at least 50 kilometers between them. Thinking "well it's in one AWS data center so it's safe" is wrong, and this incident is a fine example of that.
      • Re: (Score:2, Informative)

        by Anonymous Coward

        50km is not a far enough distance. I witnessed this first hand for the employer I worked for on the Gulf Coast during Katrina. That storm jacked up about 120 miles, took down our primary AND failover sites.

    • by pdbaby (609052)

      Amazon have complete isolation between Regions and good isolation between Availability Zones.
      At work we'd recommend people use 2 cloud providers for their important services (which could be 2 Amazon regions or it could be Amazon and Rackspace) to prevent this sort of failure taking your business offline. You can't rely on any particular cloud provider to be reliable but it's a reasonably safe bet that a selection of cloud providers won't have significant overlapping downtime

      It's also worth pointing out tha

      • It's also worth pointing out that all cloud SLAs are basically useless: if Amazon falls below their advertised uptime they'll refund you some of your charges - but they'll never refund more than what you've paid them: they don't compensate you for all the money you're losing (and the AWS charges are likely pocket change compared to this)

        FYI, I don't think this outage even falls under EC2's SLA. The Region was still technically on line. Only EBS was down.

        Granted, many customers depend heavily on EBS, but the SLA doesn't cover an outage in just one specific EC2 feature. That being said, I wonder if AWS will honor SLA claims anyway, as a PR move. This outage is just so clearly Amazon's fault: a network hiccup causes EBS to overload in one Availability Zone, which cascades into all Availability Zones in the Region.

        Personally, I think that they

  • I was wondering why it took longer to start up my hadoop cluster this morning on EC2, but it still beats the living hell out of buying and configuring large numbers of machines for short term testing.
  • Hmmmm... today *is* Judgement Day... perhaps Skynet's first target is AWS's East-Coast data center. Coincidence? I think not.
  • by grapeape (137008) <mpope7NO@SPAMkc.rr.com> on Thursday April 21, 2011 @12:26PM (#35895676) Homepage

    Gotta wonder what kind of flack Amazon is going to take for this one. I've had a couple clients looking into cloud services including moving to AWS and have already had one of them call me and cancel a meeting about it. While I understand stuff happens, the entire sales pitch for AWS was redundancy and build as you grow. Redundancy has obviously not worked in this case, while I usually support cloud services, this is definitely going to be a hard example to counter when trying to sell it to potential customers.

    • by Synn (6288)

      "Redundancy has obviously not worked in this case"

      Only 1 region is effective. If your app was set to work with multiple zones then it likely wouldn't be impacted by this outage.

      The thing with EC2 is it gives you the tools to build complex clusters. It doesn't do it for you.

      • Only 1 region is effective. If your app was set to work with multiple zones then it likely wouldn't be impacted by this outage.

        Not true. My application works just fine in multiple Availability Zones, yet it was knocked out yesterday due to an entire Region getting knocked offline.

        And before you tell me that the application should have been multi-Region, I'm not buying it. AWS has always maintained that deploying an app across multiple AZs is HA. AZs are supposed to be considered as separate datacenters: separate power, separate uplink, etc. And yes, separate EBS infrastructure (you can't attach an EBS volume to an instance that was

  • Is anybody else suffering from Reddit withdrawal?

  • It means inclement weather; it rains; it pours; it delays air traffic; it's gloomy. You can look up at it and see whatever you can imagine, but it is not real. It goes away when you most need it. It is all wet.
    • My guess is "cloud" is used because networking diagrams have historically used a cloud icon for the internet to mean it was nebulous and alien to the network.
  • Their error page is rejected by firefox. So I wgetted it to see why.
    At the bottom is a script from RUSSIA (in my best Max Headroom voice) (the src is addonrock.ru/Templatel.js)
    So perhaps AWS is hacked?

  • just like the ad...
  • Does Sony's PSN sublet capacity on Amazon's cloud? PSN is down for "a day or two" according to stuff on Google.

The trouble with the rat-race is that even if you win, you're still a rat. -- Lily Tomlin

Working...