Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Microsoft Cloud

How a Microsoft Cloud Outage Hit Millions of Users Around the World (reuters.com) 50

An anonymous reader shares Reuters' report from earlier this week: Microsoft Corp said on Wednesday it had recovered all of its cloud services after a networking outage took down its cloud platform Azure along with services such as Teams and Outlook used by millions around the globe. Azure's status page showed services were impacted in Americas, Europe, Asia Pacific, Middle East and Africa. Only services in China and its platform for governments were not hit. By late morning Azure said most customers should have seen services resume after a full recovery of the Microsoft Wide Area Network (WAN).

An outage of Azure, which has 15 million corporate customers and over 500 million active users, according to Microsoft data, can impact multiple services and create a domino effect as almost all of the world's largest companies use the platform.... Microsoft did not disclose the number of users affected by the disruption, but data from outage tracking website Downdetector showed thousands of incidents across continents.... Azure's share of the cloud computing market rose to 30% in 2022, trailing Amazon's AWS, according to estimates from BofA Global Research.... During the outage, users faced problems in exchanging messages, joining calls or using any features of Teams application. Many users took to Twitter to share updates about the service disruption, with #MicrosoftTeams trending as a hashtag on the social media site.... Among the other services affected were Microsoft Exchange Online, SharePoint Online, OneDrive for Business, according to the company's status page.

"I think there is a very big debate to be had on resiliency in the comms and cloud space and the critical applications," Symphony Chief Executive Brad Levy said.

From Microsoft's [preliminary] post-incident review: We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute.

As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed....

Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC.

Thanks to Slashdot reader bobthesungeek76036 for submitting the story.
This discussion has been archived. No new comments can be posted.

How a Microsoft Cloud Outage Hit Millions of Users Around the World

Comments Filter:
  • Cloud 9 (Score:2, Funny)

    by devslash0 ( 4203435 )

    Cheaper and more reliable as advertised.

    • Cheaper and more reliable as advertised.

      Everyone is "five nines or better". Until that day comes along that they aren't. Shit Happens isn't an immunity clause. It's a reality.

      And "cheaper" comes down to how much a minute or hour of downtime actually costs your business. Those that had their entire organization affected were worried about losing customers, not pennies.

  • Smells like the fucking new guy/gal, trained and certified, no experience. I'm sure we've all been that person before but damn Microsoft you need experience in that spot, you can afford to get this right. Did you really set a noob loose on the command line to mess with bgp?

  • by Ritz_Just_Ritz ( 883997 ) on Saturday January 28, 2023 @12:01PM (#63246883)

    Service disruptions within Azure have been an ongoing problem. You'd think they would be getting better at this over time. A lot of companies have bought into the whole O365 thing and when it coughs up a hairball it impacts millions of people.

    • The best part is, when you open a support ticket for a service issue (server returning 500), they act like this is a user issue. Report a 500 error, including the error info (server name, IP, user info, timestamp). The its please do Steps Recorder and reproduce the issue there. Now please install this Fiddler man-in-the-middle attack on your computer, compromise your TLS connections, and send us a trace of the issue. The support folks are not allowed to look at server logs and cannot ask for help until elim

    • And you know what? Life went on. We were impacted this week, but if your entire company can't survive a few hours without outlook and some teams meetings then frankly your company is too weak and should be suffer the commercial equivalent of Darwin.

      Azure does have its moments, all I can say though is those moments are fewer than what we had when we were managing our own infrastructure, and even if we did we don't have the capabilities to roll out the kind of software and services we now use as employees our

  • We use Teams for work, and I've not had access to the 'application' side of the channels for about a week now. I wonder if this has had anything to do with it, as only a select few other members of the company haven't had access either. Is anyone aware of how Teams backend works? Would some users be pulling information from a different server that is mirrored?
  • by klubar ( 591384 ) on Saturday January 28, 2023 @12:09PM (#63246895) Homepage

    Have we reached the state where everything is centralized and infrastructure is too complex to manage?

    I'm sure things like a bad update happen all the time, but they used to only impact a single company or group. Now with highly centralized infrastructure, although screw-ups are much less common, they have bigger impact.

    I'm guessing that overall infrastructure has improved as Microsoft, AWS, etc. are much better at managing their infrastructure... it's just that mistakes have a more noticeable impact.

    Also, have we reached the stage that the network/computing/storage/backup/security is just too complex? Maybe we were better off in a distributed model where it was easier to understand.

    • Probably. In terms of uptime, cloud providers don't come close to the uptime you can get with a bit of planning and redundancy.

      There's probably lot of reasons for it, but the added complexity of additional failure scenarios is probably a major contributing factor. No longer do you have twice as many servers as you need, each able to handle the expected load and largely isolated from other machines. Now, your storage is another server, and your assets are on other servers, and your application load balancer

    • Re: (Score:3, Insightful)

      Have we reached the state where everything is centralized and infrastructure is too complex to manage?

      No this is a symptom of companies no longer allowing customers to buy software. That was a one shot payment and you owned it forever. The suits have wised up and realized that renting software is the way to go. You no longer own a copy of Office. You pay a perpetual monthly fee for the privilege of use. Same with video games, music, and movies. Because we all know the cloud is so big it can never completely fail. Right.

      • you can still buy a permanent copy of office, a videogame on a disc, and a cd (or digital song permanently)

        i do agree that companies would prefer if you subscribe to permanently milk you, but it's up to the customer to decide as they still have a choice

        • by fullgandoo ( 1188759 ) on Saturday January 28, 2023 @01:14PM (#63247017)
          As far as games are concerned, I dont know of any AAA games that you can play offline even if you're playing against the computer. Also, there are multiple updates throughout the year with some having a size of 10s of GB. So the original DVD that you bought is pretty much useless.
          • As far as games are concerned, I dont know of any AAA games that you can play offline even if you're playing against the computer. Also, there are multiple updates throughout the year with some having a size of 10s of GB. So the original DVD that you bought is pretty much useless.

            You're conflating two issues.
            a) There are plenty of offline AAA games. You seem to only focus on the multiplayer ones. Elden Ring worked offline just fine, as does control, metro exodus, hitman (though you don't get to claim achievements in an online account for obvious reasons, but you can progress the story mode). But the fact of the matter is quite a few AAA titles are primarily multiplayer making the point moot.

            b) The DVD point is irrelevant. Updates are only applied when you're online. If you're offlin

      • by Bongo ( 13261 )

        Squeeze as a Service

        Although maybe IT departments are also to blame. To up their own importance they made lots of red tape, to ensure everything IT had to go via them, then they got bored and lazy and sent it all to the cloud, so they could promote themselves as service managers. Then the centralised thing that's so big you now have no influence on it anymore goes down, and everyone is like, what's the point of you there in the IT dept?

      • We need to reverse this process. It starts writing software for the desktop again, and not just webapps. Then we need to be able to buy our music, movies, books, etc ..., but from places that understand those items.
    • Have we reached the state where everything is centralized and infrastructure is too complex to manage?

      In some ways yes and in some no, as usual.

      Without knowing a lot about the internals of Azure's management it's difficult to know what actually happened, why, and what can/could/should have been done to prevent it. It might have been a symptom of overcentralization. The appeal of the cloud is supposed to be that it's fault tolerant because it's distributed. This makes it sound like Microsoft was able to push one change and wreck multiple cloud hosting sites, which arguably shouldn't even be possible by desig

      • WAN errors can happen an small ISP with an bad BGP setting can lead to bigger issues out side of there network.

        • Yes, but there are technologies which exist to solve that problem, and they're just mostly not being used.

      • by Bongo ( 13261 )

        A network is not distributed, it's one thing.
        And it has surprises.

        • A network is not distributed, it's one thing.

          But with the internet[work] involved, it is both things, because it is a network of networks by definition.

      • This is a distributed model.

        The cloud is absolutely NOT distributed, despite what the sales departments tell you. The cloud is a single, centralized point of massive failure, as is demonstrated by this very event.

        When klubar asked about returning to a distributed model, I think he was talking about companies having the good sense to manage their own infrastructure. In a distributed model, nobody cares if Microsoft is Microsofting. Microsoft can mismanage their networks to their hearts' content, and no one else is going to notice.

        • The cloud is absolutely NOT distributed, despite what the sales departments tell you. The cloud is a single, centralized point of massive failure, as is demonstrated by this very event.

          Yeah, but it shouldn't be, and that's not how it's sold. Corporations are definitely using the cloud because they think it reduces the number of single points of failure.

          When klubar asked about returning to a distributed model, I think he was talking about companies having the good sense to manage their own infrastructure.

          Most companies didn't have distributed infrastructure though, they had all their eggs in one or more baskets they couldn't afford to lose. Even with relatively cloudy infrastructure (e.g. using a job queuing system like DQS) they would typically have single points of failure like the qmaster, all their colocated resources would be located i

    • Its cheaper on a per-quarter basis than buying once.
      And it is better for revenue streams to actually have a stream.

      > Have we reached the state where everything is centralized and infrastructure is too complex to manage?

      yes we have. layers upon layers of fake and leaky abstractions. And depending on a bunch of OSS software that is "for free" and falling into dis-maintenance. Obligatory: https://xkcd.com/2347/ [xkcd.com]
      Same as with physical infrastructure, building new and shiny is always more fun than maintaining e

    • by eth1 ( 94901 )

      I think we've reached a point where the software running the infrastructure is too complex.

      When I started in IT in the late 90s, major outages were ~10% hardware/software bugs, and 90% human error. Today, it's the other way around, and the constant pushing of patches to fix things just makes it more likely you'll get hit with some kind of regression issue.

      NONE of the major vendors are reliable anymore. Cisco, EMC, Palo Alto Networks, CheckPoint - we always have multiple support tickets working that are wait

    • Now with highly centralized infrastructure, although screw-ups are much less common, they have bigger impact.

      The impact may be bigger in one go, but is the impact bigger overall? E.g. When Chevron and Aker are both unable to work for 3 hours is this a bigger calamity Chevron being unable to work for 3 hours on Monday, and Aker being unable to do it on Tuesday? That's where we were. We hear about cloud outages now because their combined impact makes them big news. But just because we didn't hear about hundreds of smaller outages in the past doesn't mean the overall impact is now bigger.

    • by gweihir ( 88907 )

      Nope. Or at least not in this instance. BGP-storms are well-known to happen. It is just MS that is, as usual, clueless and does not have adequate testing and is ignoring history and best practices.

    • by King_TJ ( 85913 )

      Wow! I was going to post this exact same question! I think yes it is!

      I know even at the relatively small business I work at, we've got a pretty complex WAN set up with multiple VLANs for various services/devices to use and a number of firewall rules imposed for VPN traffic on top of that. They've been going through a 2 year long process now of migrating sites over from Windstream too, so we've had 2 sets of IP scopes essentially running in parallel to manage as circuits get cut and moved over.

      It's honestly

  • I must be the odd ball out because I didn't experience any sort of outage or performance issues this week.

  • by bb_matt ( 5705262 ) on Saturday January 28, 2023 @12:48PM (#63246963)

    We all know HOW such giant systems end up gaining traction and being adopted - cost.

    Exactly how it happened historically is complex, but I'm sure a lot of it is down to sales techniques - and I guess it looked like a no-brainer to most companies.
    "Pay a monthly fee and you don't need to maintain your own complex systems."
    "We're the experts - we do what we are good at, you can do what you are good at - plus you can fire half your system engineers! - win win!"

    However, we now are in a position rapidly approaching single points of failure that impact millions of people at the same time.

    Sure, a company maintaining their own office services - comms, documents etc. - can and do (did?) go down - but it's then isolated.
    The impact to the business could range from minor to profound - but it was _just_ a single entity and there were ways and means to mitigate it - resorting to private emails etc.

    When you end up in a position where a company has no access to these services and neither do their clients, because EVERYONE is using them, that is NOT a good place to be.

    But it's all about the money, right?

    I'm sure some heads will roll, but give it a few weeks, it will all be forgotten about ... until the next time ... and the next time ... as we rush headlong into ALL the eggs being in ONE basket.

    • by ksw_92 ( 5249207 )

      Let's apply this to other critical infrastructure that's required to run most businesses:

      Electricity: Hmm...fair play there as we seek to decentralize a bit via solar. BUT...wind and other large-scale energy sources still require a network to distribute power. Do you really want micro-nukes or windmills on every single building? Or maybe go back to the industrial revolution-era campus power plant design?

      Water: Let's have every building on its own well. Oh, wait...there's not enough ground water in some are

      • Let's apply this to other critical infrastructure that's required to run most businesses:

        We can rephrase that starting with, "which one of these is completely unlike the others?"

        They are not comparable, as one is completely unregulated, and the others are highly regulated. One is not even remotely close to looking like a utility, while the others are very much regulated like the utilities they are.

        The cloud is an IQ test being failed by millions of people, and which is completely unnecessary for most companies, while utilities are necessities which, by their very nature, support life.

    • When you end up in a position where a company has no access to these services and neither do their clients, because EVERYONE is using them, that is NOT a good place to be.

      That's a big leap of logic. A more realistic view would be that most companies in the world are largely isolated from each other, and the impact of inter company emails and meetings going down for a few hours is incredibly minor in the grand scheme of any corporate operations.

      Incidentally we were in just such a situation this week. A detailed project review with Worley. They couldn't dial in. We couldn't dial in. Teams was down, Outlook was down, Onedrive was down. So we picked up the phone, confirmed that

  • "Black Swan Services"

  • by Miles_O'Toole ( 5152533 ) on Saturday January 28, 2023 @01:56PM (#63247079)

    There is no "Cloud". There is only other people's computers, which you don't own or control.

    • There is no "Cloud". There is only other people's computers, which you don't own or control.

      This does have an advantage: When cloud email goes down you can look straight at the DIrector's eyes (who insisted we cloudify it in the first place.. we still have on-prem exchange for some deep legacy stuff) anyway.. when the cloud thing goes down, you can look at the Director and tell them "Nothign we can do, we just have to wait for Microsoft and ride it out"

      And I don't have to spend my own cycles trying to fix it.. but just seeing the bigwig's eyes bulge when he realizes there's literally nothing anyo

      • The only time this wasn't funny was with a Time and Attendance outage.. I'm not gonna name the company that caused the outage but I bet y'all know who it was. This was last year. IT and Finance had to burn the candle at both ends to "creatively engineer" a solution for timekeeping..

        Same here. It was a little primitive, but it worked [youtube.com]

      • I'd have given a lot to be in the room with that Director when he finds out how screwed he is.

    • Re: (Score:2, Insightful)

      by thegarbz ( 1787294 )

      There is no "Cloud". There is only other people's computers, which you don't own or control.

      It needs to be said about as many times as: "those other people are better at running computers than you". You're begging the question with your post. The story here is MS had a cloud outage. No where does it say their customers would have better uptime if they didn't run on that other person's computer.

      • Keep it up with that No True Scotsman logical fallacy. Maybe some day you'll find somebody who falls for it.

        • I feel like you don't understand what a no true Scotsman fallacy is. This is a simple question of reliability. Everyone says "someone else's computer" as if that is a bad thing, while in reality this is no different than paying any other service in any other service industry.

          If you want to do absolutely everything yourself, go work for Apple. For everyone else, they largely lack the expertise to achieve the same uptime as larger cloud providers, especially given the complexity of modern services. Fuck man m

  • Microsoft Sweden had planned a live Webinar about cyber resiliency - what to do when all your stuff is in random cloud and then some crisis happens..

    The outage hit just before the Webinar started, which actually held up, but comments from attendees were not of satisfaction..

    Oh the irony..

  • by gweihir ( 88907 )

    MS is running infrastructure it does not understand. It is doing this without adequate testing. And then it replicates problems that are well-known.

    Why are these incompetent fuckups even still in business?

You knew the job was dangerous when you took it, Fred. -- Superchicken

Working...