Transcript of Azure Incident Retrospective: Storage issues in Central US, July 2024 (Tracking ID: 1K80-N_8)

Introduction - Welcome to our Azure Incident Retrospective. I'm David Steele. - I'm Sami Kubba. We work in the Azure Communications team. - in addition to providing a written post-incident review after major outages. We also now host these retrospective conversations. You're about to watch a recording of a live stream where we invited impacted customers through Azure Service Health to join our panel of experts in a live Q&A. - We had a conversation about reliability as a shared responsibility. So this includes our learnings at Microsoft, as well as guidance for customers and partners to be more resilient. - We hope you enjoy the conversation that follows. - So with that, I'd like to bring up our panel of speakers that are gonna be talking us through the incident today. So who do we have up first? - Well, first, we have Tom Jolly, Vice President of our Azure Storage Service. Welcome. Thank you for joining us, Tom. - Thank you, Tom. Storage will be a big part of the play today. Next up we have Asad Khan, who's our Vice President of Azure SQL Database. If you've read the post-incident review, you'll be familiar with the ways in which SQL handled or didn't handle different parts of this element. - Super, we have Kirill Gavrylyuk, our Vice President for Azure Cosmos DB, our globally distributed database. - Very good. Glad that you could be here, Kirill. I think you've done one of these before. And next up, new to our Incident Retrospective series, we have Scott Dallamura. Scott is the principal software engineering, principal software engineer for Azure DevOps. We're gonna talk a little bit about how impact to Azure DevOps was a bit broader than some of the other services that were impacted. And last up last? - Last but certainly not least we have Dale Churchward, our service engineering manager for Microsoft M365. He's here to talk and represent the M365 side of the house, which had impact. To kick things off first, over to you, Tom. Before we get into the incident itself, we mentioned something about the allow list, and the importance of the allow list and how it played a crucial role in this. What do we need to understand? What is the storage allow list and why is it so crucial? What is the Allow List - Sure, so actually, just before I answer the question, I really just want to say, you know, I want to apologize honestly to our customers here for the incident. It was a very long incident with a lot of impact for customers, and we fell, you know, very short of our goals and expectations for, you know, the service that we provide to our customers. So I just wanted to start out with that. To answer your question, so the allow list, it's a sort of security defense in-depth measure. So in Azure, when you have a virtual machine running on a VM host, the disk is typically hosted by storage clusters. And we want to make sure that, a sort of defense in-depth measure, that we only accept requests for disk I/O from IP addresses that are known to belong to a VM host machine. So we have this configuration that we're just calling the allow list here, which is published to every storage cluster in a region. And whenever a request comes into that storage cluster, we'll check the IP against that list... For a disk, it will check the source IP against that list. And if it's not coming from a known VM compute host, the request will be failed and dropped. So obviously we build hardware all the time, we decompile hardware all the time, and whenever a new rack of compute hardware is brought live in the data center, the storage clusters all need to know about the IP ranges that those hosts are gonna be using. And that means that every source cluster in the region at that point gets a configuration update, to add that set of IPs to the allow list. And this happens kind of in the workflow for, you know, deploying new compute hardware currently. And it's happening pretty much every day in any nontrivial region; this is happening every day, right? Hardware's being added and removed every day, pretty much every day of the year. So that's kind of the background there. - Great, that helps us to understand what the allow list is and why we use it. I wanna ask you what broke, but I know, Tom, during the post-incident review, we were talking, I really wanted us to add a lot more detail into the post-incident review about the specific networking hiccup that got in the way of generating this allow list. And I really liked the point that you made: it actually didn't matter why this specific thing got in the way of the allow list. It sounds like your repair items, that we'll get to later, are really to make sure that the allow list, you know, can be generated no matter what. But could you talk us through what broke and why that usual process of updating the allow list didn't happen as expected in Central US on this fateful day? - Yeah, so the workflow that generates the list What went wrong obviously pulls that information from some of our kind of data center, you know, source-of-truth systems. And on that particular day, the system where the workflow generating the analysis was running had a problem reaching that source of information. And it resulted in it generating a list that was partial. So the list that was generated by that workflow at that time only contained about half of the VM host IPs in the region. And then it was that partial list that was then deployed to the storage clusters, causing them to drop VM I/O requests from, you know, a lot of the VM hosts in the region. Now, that particular failure mode was relating to some network configuration on some backend infrastructure where we host that, but that's not really the key thing here, right? The key thing really is that things are going to break; there might be other things that can go wrong. And the real hole here was the safe deployment gaps that we had in how we apply this list to the storage process in the region. - So Tom, for our customers who don't know, just very quickly, our safe deployment is usually when we deploy a feature or a service or we add something. We don't flip a bit and turn it on globally across Azure, because we know that software contains bugs and there are issues. And so we do this in stages and we do it very systematically. Now, I always typically think about safe deployment being applied to features or services or we enhance something. Is it true that we should, are you saying that we should have been using safe deployment even when we decommission and even when we use buildout, when it's not software, when it's kind of hardware-related as well? Configuration and code - Yeah, absolutely. I mean, configurational code, we treat as the same thing in general. So whether we're changing a setting to, you know, might be to tune performance or it might be to enable new capabilities, we treat that like code. And it's a deployment. We consider it a deployment. And it should follow safe deployment practices. And those, I'm sure people have heard this before, but that means looking, you know, sequencing changes through, for example, regions, within a region through availability zones, and for storage, even within an availability zone, we have sort of a risk hierarchy of different storage types. For example, we'll start often by applying a change to a standard storage geo-secondary cluster. And at the other end of the scale, for kind of risk or impact is our premium storage disks, clusters, and also our ZRS, zone-redundant clusters. And then of course, also, the pace of deploying a change, right? You start slow and small, and then as you build confidence with a configuration change or code, you can gradually sort of fan it out. But even while you're doing that, you need to have health checks at all steps to make sure that there's no signal telling you that, hey, something is not working the way we expect here. We're impacting customers, or otherwise, you know, something is looking worse than it did before we applied this change. So we do that, obviously, not in this case, which I'll come to in a moment, but we do that: all of our deployment systems work that way for configuration and code. In this case, the big miss here was that this is sort of, this update of the allow list is really a region-wide storage configuration change; essentially, that's what it is. And this was not using one of our standard configuration rollout mechanisms. It was a very specific update flow that was living in this compute buildout workflow, and it did not have any of these protections in it. So when we had the problem we talked about briefly earlier where there was a backend infrastructure problem and we generated this list that was partial, unfortunately, this workflow then deployed it across all the storage clusters in the region without any sequencing through zones, storage types, and without health checks. And that's sort of, you know, there's no way to justify this or explain it. It's a really, really bad failure and badness on our part, right, That we had this piece of workflow that did not have STP and was not using a normal configuration deployment system. - Yeah. Thank you, Tom. I appreciate you breaking it down. And exactly as you say, right, we're not justifying what happened; we're explaining what happened, and I appreciate you explaining the room for improvement there from a safe-deployment perspective. Before we get to those learnings and repair items, I did want to ask, I'm not sure if you were on call yourself, but I know your team was heavily involved in this incident. Yeah, what did that look like as far as our detection and investigation and mitigation? How did we respond and how did we know what was happening? Challenges - Hmm. Yeah, so one of the challenging parts of this one was that because the impact was region-wide and it affected a lot of VMs, it also affected some of our own infrastructure that we use, typically, when we're trying to troubleshoot a live site incident. So we did lose some, not in terms of backend management APIs and portals, but some things we use for analysis we didn't have available. That slowed us down a little bit. And then I think the other thing was because it was so widespread, and initially, there was some networking deployments identified that had gone out to compute hosts at around the same time, and that was the first suspect for the cause of the problem. So initially, there were a lot of people involved in this, but initially that was the belief, that that was causing the problem, that somehow the compute host IPs may have been changed, meaning that they didn't match what was configured on storage. So it did take a bit of time before we realized that actually, no, that's not the problem. The problem is the storage-side configuration has changed and is just wrong and missing a huge number of the IPs. So it did take a while to get to the bottom of, you know, realize that this other deployment that had happened for networking on VM host was actually not the cause, and it was actually a storage-side configuration problem. And this is another gap we had here; We didn't have alerting on the storage side for allow, you know, rejecting requests because the IPs are not in the allow list. We had metrics for that; we did not have alerting on that. And one of the systems that would've told us that was the problem, we have automatic triage and analysis that looks at a lot of different things in the system and points our engineers usually in the right direction. That was, unfortunately, because of the scale of the impact here, that was one system that was affected by the outage itself. So it did actually identify exactly, you know, correctly the problem, but it didn't do that until after we'd mitigated the outage, because it was not able to work. So that was another kind of repair. And the other thing that slowed us down a little bit, ironic really, is that the tooling we needed to mitigate this, we needed to apply a configuration change to all the storage scale units in the region again, right? And we didn't have tooling ready that will do that at the speed that we really wanted to in this case. 'Cause usually we don't want to do things fast; we don't want to apply changes across the whole region quickly. So that's another bit of a learning here is that, you know, sometimes we really need to be able to do that, and we need a safe and well-understood way to do that in these rare situations where we may want to sort of blast a change to mitigate an incident. - Thank you, Tom. So in addition to this, and we speak about the alerting was down as well because we had a circular dependency on the systems that are up, and in addition to this, we're talking about- - Sorry, the alerting wasn't down. What was down was... We have an auto analysis, auto triage system, which looks at lots of things and basically says, hey, it thinks this is the problem: that was down. Alerting was working, but we had missing alerting on this failure metric that would've, you know, would've enabled us to figure out immediately, hey, this is the problem, most likely, rather than having to look across networking and everywhere to try and figure out what was going on. Sorry, go ahead. - No, that's a helpful clarification. Thank you. And then of course, in changing, making sure that we follow a safe deployment or a STP kind of process as well. Are there any other learnings that storage is taking away considering it was such a big event? - Yeah, I mean, there are learnings and repairs, right? Learnings Repairs I mean, in terms of repairs, we've deployed a lot of repair items already. So for example, we now have alerting on the allow list rejection rate. So that will immediately, you know, call a storage engineer if there are any requests failing because of the allow list not matching the VM host source. And then in terms of the workflow that was deploying this allow list change, that was the big hole here, the big failing. That we've already updated. It now has full STP sequencing across zones, across storage clusters, types of storage cluster. The rollout time is extended as well. Previously, it was just going as fast as it could. So I think it went through US Central in around an hour. Now this happens over a 24-hour period. And in addition to all the sequencing, we have the health checks in the job, in the workflow here, that is monitoring for these failures all the time. And it will halt the job if there were to be any failures. And then of course the other piece is making sure that we're not generating a partial list. So the checks for any missing source information are now in place as well. So even if we had a repeat of the backend infrastructure, you know, network accessibility issue that caused this job to generate a partial list in the outage, now we would stop the workflow in response to any failures in kind of getting the source information. Longer term, we want to move this configuration job just out of the buildout workflow. You know, ideally, it wouldn't be there and be so time critical. So that's a longer-term repair, is to look at a different mechanism here. And then in terms of other learnings, we are doing or we have done another scrub of areas like buildout, and, for example, to really make sure there's nothing else that has somehow snuck in there that should be going through a different deployment mechanism or has missing STP checks. So we've obviously done a lot of investigation there to make sure we're not missing anything else in this class. - That's great. Thank you, Tom. I appreciate you kind of running us through that. I know you've got specific ETA dates of, you know, other repair items and in the post-incident review there. And if you are watching us live, I'll encourage you, if you've got any follow-up questions on anything that Tom's mentioned, any element of the storage incident, any of the repair items, we've got storage subject matter experts standing by to answer your questions. So please take advantage of that Q&A panel on the top right corner. Tom, thank you, that helps to cover us the kind of storage side of things here. So next up I'd love to move over to Asad. Asad Khan runs our SQL Database team. Now, obviously, Asad, SQL has a dependency on storage. Those databases have to live somewhere. Could you help us to understand what this incident felt like to different customers? I understand it varied depending on how they'd configured their SQL? - Yeah, thank you. Impact So as you said, the impact was pretty wide, because, obviously, SQL depends on the storage and the VMs. And it was a region-wide impact. And as a result, both the management operations as well as the connectivity had an impact and the databases were unavailable. Now, SQL had its own alerting mechanism, so we were able to detect it within the first few minutes that the connectivity is dipping. And we reached out to the customers, we sent the comms. But since this was region-wide impact across all AZs, the only guidance we could give was that if customers have their DR strategy in place, they should execute on that thing. And then there's a part of it where we have a feature where you can also do Microsoft-enabled thing, where Microsoft will do the failover for them. Now, it has some downsides, and we can go into the discussions for that as well, but as per the documentation, we did the failover for those customers in the first one hour. - Super, Asad, it's a great talking point, because customers have the option to say, "Hey, I would like Microsoft to manage my failover." And they don't really get a say in what happens. And so you'll fail customers over, as your playbook, but I understand that can be a heavy lift for customers, and it can introduce problems depending on the customers. Could you talk a little bit more about, you know, the advantages and disadvantages of the customer, the managed failover that Microsoft does, - Right, yeah, no, absolutely. So from the disaster recovery, there are two parts. Disaster Recovery One is that within the same region you can choose to run your database across multiple availability zones. And when you do that, there is no burden on the customer. If one availability zone goes down, the database will continue to be available and we will be using the other availability zones. Now, if there is a region-wide outage, which happened in this case, and all availability zones are impacted, then the option is that you have to set up a DR in a peer region. And obviously, this you have to do ahead of time. And in that case you have, think of it, you have an async copy of that database that is in a different region, and we are continuously doing the sync between the primary and the secondary. Now, you can also choose to have it as a read replica, and there are other other options as well, but in case of the disaster, a customer can choose to say, okay, the database is unavailable for whatever reason. It could be region outage, it could be one database issue, it could be SQL engine issue, connectivity issue through the gateway, anything, but my database is unavailable so I want to do a failover. When you do the failover, this becomes your secondary and the primary becomes what originally your secondary one was. And that is how you get back the availability for the database. Now, originally, we were not very clear in terms of which one is a more preferred option. We were leaving it to the customer to decide whether they want to do the failover or they want Microsoft to automatically do the failover. And our documentation says that if there is an outage and it is longer than one hour, then we will do the failover. Now, the thing is that it is a failover group, which means it is not about a single database; it is a group of databases. So there is some gray area, like, should we do a failover if half of the databases are unavailable, or should we do the failover even if a single database is not available? The second thing is that when we do the failover, we will do the friendly failover, which means we will ensure that the secondary has fully caught up with the primary; but if that does not go through, we will do a forced failover, because at that point we are like, customer is saying the availability takes precedence, even if not all data is is fully saved. So now you can see, like, we are making decisions on behalf of the customers, and that it is better for customers to make those decisions for them. The last part is that obviously an app contains lot of components, and just failing over the database means that it might be that your compute is running in a different region and your database is in a different region, may not be ideal for many reasons. Obviously, the latency itself plays a big role. And the last part is that even when the original database comes back online or the region has fully recovered, then we don't failback. Because now, as I said, like, it might be that you are running the entire stack in a new region. So in summary, what we have seen is that customer-managed DR is always preferred. We still provide you the the Microsoft-managed DR thing, but more and more through documentation and the product experience, we are guiding customers how to set up their DR and how they take control of that thing. Because, as I said, like, it's a very rare operation that you have to do, and it's best if the customer makes the decision at when to do the failover. - That's great, thank you for clarifying that, Asad. I'm interested, since we're talking about the failover and the failback anyway, we mentioned in the post-incident review that at one point most of the databases came back online, but there was some extra manual work needed to recover them. Since you said we don't do the failback, I presume that was for the initial failover. Could you help us to understand what kind of hiccups were faced and how widespread that was? - Yeah, that was the initial hiccup that we hit. And just to be clear, and we put it in the document as well, that the Microsoft-managed failover, the number of the database which had that setup, was 0.01%. So it was a very, very small group. And for those databases, a subset of the databases, when we did the failover, the connection was still pointing towards the original primary. And that is something which is in the repair item, which we will be finishing in the next couple of months. But that is something, as you called out, it was an additional issue. And then when the region did come back, we were able to recover 80% of the databases in the first two hours. Then in the next hour we had 98% of the databases recovered. And then there was some long tail in terms of getting to 100%. - So thank you very much, Asad. I know Tom went into detail, I mean storage being the trigger of the incident, into a lot of the learning and repair items; from a SQL perspective, I'm interested to know, our customers are interested to know, what have you learned from this and what are the repair items that you'll be implementing carrying forward? - Yeah, I think the key thing is, on our side, a couple of things. One is that from the guidance through the customer, I would say multi is still the most important thing. I know it did not help in this case, but trust me, like in 99% of the cases, that is the biggest savior. The second thing is that the DR strategy has to be put in place, whether you do it yourself or even if you ask us to do it; that one, it is super critical. On our side, a couple of the repair items, one is more clarity, which I was just describing in terms of how the Microsoft-managed DR works is one. The other thing is that the redirection that you pointed out, that there was a subset of the databases, when the failover happened, the redirection did not happen on the connection level. And that is a super critical one for us. That one has no excuse, like why the redirection should not happen right away. The third one is that we are working very closely with the Azure Core in terms of how the OSS images are put on the storage. And as a result, the storage outage also impacted on the compute side, and how the ephemeral disks can help over there. And that is something SQL will take the benefit right away as Azure Core delivers that capability. - That's great. Thank you very much, Asad. I'll now turn our attention over to Cosmos DB. Kirill is here from Cosmos DB, and I wanted to start with you. Kirill, I understand most people use Cosmos DB precisely because it's multiregional and the benefits that it provides here. But I saw in the PIR that the impact of this outage on different customers depended on how they configured it. So could you help us to understand kind of which customers felt the most pain relative to their configuration? - Absolutely. First of all, I wanted to join my colleagues Cosmos DB in apologizing for this really unprecedented outage. I don't know, in my memory, there was no such outage on Azure earlier. And when it comes to Cosmos DB, customers first always sometimes wonder, given documentation states that the data is stored on local SSDs, why was Cosmos DB even impacted by this? It's a great question. And I think as Asad alluded to that question a little bit, is while the data is stored on local drives, the VMs that serve the data, they run on Azure VM scale sets and use storage to store the OS drive. And that's why half of Cosmos DB nodes went down during this outage. And that's something that Azure Virtual Machines is working on. And we will deliver the capability where the OS drive can be cached locally on the host so that even if storage goes down next time, services like Cosmos DB that don't use storage other than through virtual machines will not be impacted anymore. Now, when it comes to global database, yes, Cosmos DB has multiple configurations. The golden standards, the true global database configuration is the active/active. This is a configuration where application can write and read into any of the regions enabled for the database. In this case, even if one region goes down, application continues to write and read into the remaining regions. Typically, a good idea is to architect your application as well as active/active. And not every application can be done easily this way, but it does require some work. But we have customers, like Walmart, for example, runs fully active/active, right? And so a single region outage does not impact Walmart operations. Now, there are other ways to configure Cosmos. So these customers were not impacted by Central US outage. If your database was enabled with multiregion rights, the application automatically would redirect writes and reads into healthy regions. For customers who were configured with active/passive, where reads are global but writes go to one region, if that region happened to be Central US, those customers lost the write availability. In this case, either Microsoft or a customer have to perform an operation to move the write region somewhere else. Or as what we call it, offline the region. Typically, Microsoft does it. Recently we've learned, as Asad alluded, is that in many cases, we we are making decisions on behalf of the customer, and that's not always optimal. And so we already rolled out to some customers, capability to do offline the regions themselves. And it's proven that it's a better strategy. It's always a better strategy if customer knows exactly what's good for them. And in this particular case, it was actually better strategy because customers will be able to more efficiently do this. Now, some customers did it; for many customers, Microsoft failed over or offlined Central US region for these active/passive accounts. We first had to offline, failover our control plane, and we started offlining Central US region for active/passive accounts within the first 40 minutes. So it was very quick. The outage was caught within a minute of the impact, and the process flows, it's semiautomatic, and we started offlining the regions for impacted databases. Now, because some of our internal automation was also impacted, it was really an unprecedented outage, this automatic detection of whether a database is impacted or not had flaws. As a result, we prioritized, made a decision to prioritize database accounts that were reported through support. Now, that strategy also has flaws, because sometimes customers cannot even reach support in these situations, right? It's not always because, you know, portal may be down or some other things can happen in these outages. So customers who were able to do this offline regions themselves were in the best case. We offlined, and 95% of those offlines that we performed went smoothly. It takes roughly 10 minutes on average to offline the database account. Then of course there were some subset, there was some subset of database accounts for which a client had to be rerun. And I think the document describes two configurations. It was a fairly small subset of customers, but there were two configurations: one with some MongoDB API customers, and another was some private endpoint customers where we had to redo... sorry, some MongoDB API customers. And then when we failed back, there was some of the private endpoint customers were affected during failback that required us to redo the failback. Of course, if customers do not use global database capabilities of Cosmos DB, and in the single region, for a region-wide outage, they were impacted and didn't really have much recourse until the availability of the region, of the storage came back and DMs came back and we restored availability of Cosmos DB. - That's great. Thank you, Kirill. I was gonna ask you more about the failover and the failback, but I say you've covered that really well. I wanted to ask about the learnings and repairs, particularly because there's one that I had to read three times, and I'm hoping you can help me to explain it here. One of the learnings was adding automatic per partition failover for multiregion active/passive accounts. Could you explain that so that my mom could understand it, and any other kind of learnings or repairs from a Cosmos DB side? - Absolutely. Generally, for global distributed databases, Automatic Failover global databases, when data plane decides to mitigate, it's always the best because it scales, it has no impotence, it can make local decisions, what is best for this particular node, for this particular partition, versus control plane, a separate service looking at the data plane and lots of failover modes in there in between. So per partition automatic failover effectively allows us to only failover parts of your database, only those partitions that are impacted, automatically, based on the view that a quorum of observers has for that partition, which is a lot more precise than us guessing based on telemetry or even customer guessing based on what they see in the portal, et cetera. It's immediate, it's real time. We have exact view, a quorum view of what's going on with the partition, and we can failover. So that happens automatically without any intervention from the customers, and it happens only for those subset of the data that is impacted, without having to failover the entire set of accounts. Again, this capability only makes sense for this active/passive set of accounts: best if you all just use active/active. And more and more, it used to be a cost, there was some cost impedance; with dynamic autoscale capability, that cost impedance practically went away. It's a fairly cost effective way to achieve the five nines resiliency and never have this headache. But if your application, this doesn't really require semantics of active/passive, this per partition after failover is gonna dramatically change the experience during, hopefully not happening, outages like this, where it's done directly by the DataBlade, directly by the nodes based on the quorum, and application doesn't have to do anything. - Super. Thank you very much, Kirill. It's fascinating to hear about Cosmos DB. It's a globally distributed database; If you configure it the right way, you can be immune to a lot of outages. That's great to hear. Thank you. I'd like to turn my attention to Scott who works with Azure DevOps. Scott, we have Azure DevOps, or ADO, running in multiple regions; and in this case the impact was to Central US, but was there impact beyond Central US for Azure DevOps? Outside of Central US - Yeah, unfortunately, yeah, there absolutely was impact, or to customers outside of Central US. So Azure Ops, it's like a global service, right? It's available pretty much everywhere. Central US isn't really like a central point of failure, although it kind of felt like it for some customers here. So some historical reason for that, like, DevOps has been around for a long time. It's gone through a lot of growth. It started out in the US only; actually, it started out on-premises, and we migrated to the cloud a long time ago. But now it's available worldwide, and we have scale units I think in like eight geographies or so. And without getting into too much sausage making, an Azure DevOps organization, like when you use Azure DevOps, like, you have a hosting location, which is basically where like the portal service lives, this is the web app that you see when you use like pipelines or Git or work items, that's hosted in a particular region. And outside of that we have a bunch of other supporting services. Some of these are critical, some of them are not, that serve the overall experience. And for some customers, you know, for historical reasons, right, like as we've grown, not all the data has migrated out of some of these things, but not all the data is in the region that your portal services; some of it might still be in Central US. In other cases we have architectural concerns. Like, we specifically store your profile data in either the US, the EU, or UK. And if you're not in the US, EU, or UK, your profile data basically defaults to the United States; which means, for example, customers in Brazil have a pretty good chance of being impacted when a US region is down, because their profile data goes to the US. So considering all that, it's not really impossible for a full outage in something like Central US to be felt by customers in another region. Most of the impact that we saw outside of the US was in Brazil, but there were definitely customers in Europe that felt this too. The way that impact manifests depends on, you know, which one of these supporting services might be in that failed region or which scenario you're looking at, right, 'cause, you know, some scenarios don't touch certain data, but it's definitely possible. Again, it's not super likely, like the last time I ran the numbers, I think like less than 5% of our customers that were housed or hosted outside of Central US had some kind of impact. But yeah, that doesn't really help, but it doesn't feel unlikely when it happens to you, right? And I'd like to apologize too, by the way. You know, I've been using Azure DevOps for like 12 years. Our whole team has been using it for a long time. We love the product. We feel it when, you know, when it doesn't work for customers. You know, we use it day to day, so it's not great when it goes down. - Right. Yeah, thank you, Scott. Thanks for explaining that. So like you say, for a lot of customers, they didn't feel the pain, but for the ones who did, that this was really impactful here. So can you help us to understand some of the learnings or repair items from the ADO side? I saw in the post-incident review we mentioned, migrating metadata. I wasn't sure if that's on us, as Microsoft, to do the migration. It sounded like it's also possible for customers to migrate that metadata themselves? - So this would be something that Microsoft would do, right? Like, we have, you know, we have a list; we know what data's in the wrong spot, right? So we've been working on this. We call it a reconciliation, you know moving this data out of, you know, where it is, for whatever reason, whether it used to be there or it got placed there, you know, for whatever reason. But we're reconciling those. It's an ongoing process. I'm not super clear on the ETA. It's gonna be a couple months, probably. We'll speak to that more I think in the PR when we finally publish it. We will have recommendations for customers. And like it is possible for you to change, you know, your hosted region in Azure DevOps, and it's not a trivial option. It may incur downtime depending on how big you are. I wouldn't recommend that as like a mitigation for this particular scenario. I think the better thing to do is for Azure DevOps to actually make sure that we do the right thing, and, you know, have customer data where they expect it to be, you know, to eliminate some of these dependencies we have on certain services that may or may not be critical for certain scenarios. You know, some cases, like, for example, profile data, right? Like, you know, it's not distributed everywhere, right? So that might be something that we look at too, is, you know, maybe spinning up some more scale units of some of these services so that they can better serve other regions. And then, like, you know, we also, you know, as part of all the deep dives we've done so far, digging into like, you know, what a certain impact looks like, you know, we have definitely have opportunities to like, you know, just behave better in the case of an outage, for things especially that aren't critical. A lot of customers who saw like, you know, not necessarily failures but like just delays in the experience, were basically because we have some spots where like we'll do like an exponential backoff for something that isn't really critical. Something that we could have just like, you know, flipped a circuit breaker, skipped that for now, you know, would've provided a much better experience for a lot of customers. So those are the repair items we're looking at. - Super, thank you very much, Scott. Dale, if I could bring you into the mix now. Thank you for waiting so long. Dale, there was impact to M365, which is our SaaS solution that, you know, that's running in Azure. And SaaS in my mind, there should be minimal impact. But I'm so interested to know that there was impact to M365. And could you please tell us how did your systems respond and how did your people respond when you have an outage caused by Azure? M365 - No, thank you, Sami. So I would say, you know, we saw varying levels of impact within M365, and the reason for that is, there's a a few things. First and foremost of course is that we have dependencies within M365 on kind of foundational services that we run behind the scenes that help support the main services customers see. In addition, of course, we also have dependency on things like Cosmos DB and SQL Server and Azure, which of course you've already kind of heard what some of the impacts there were. But fundamentally, because our services, you know, are architected different ways and serve customers in some different methodologies, we did see quite a variety of impacts. So just to kind of give some specific examples, you know, if you look at something like Veeva Engage, that was probably the thing where we saw the broadest level of impact. And the reason for that is that Veeva Engage is hosted generally out of the North America and European regions, regional data centers, and so this issue had global impact because of course if you were hosted in the European data centers, you wouldn't have seen impact at all. However, we do have customers in the Americas, Japan, Australia, and other Asia Pacific countries that would have experienced the impact, not only because of Veeva Engage being run in US data centers, but also because of the fact that it was during their business hours. And so that was probably the single broadest impact we saw. But also, you know, we did see some fairly significant impact in Teams as well, primarily because users were unable to join or initiate calls or meetings. Some of this functionality was also related to if you were in a call or a meeting. So examples of not being able to mute or unmute people, raise hands, remove users; also, presence information was affected. And even once Teams recovered, we actually continued to see some residual impact with presence information. And so that's something that the engineering teams are looking at to try and resolve. The admin center was affected, although that one's a little bit different. I mean, you basically had some users who were unable to access the admin center. But I do wanna call out that even at the height of the issue, we were still seeing an 85% success rate of people being able to connect to the portal and be able to use things within the portal. And so certainly not to minimize the impact there, but we were still seeing most transactions to the portal succeeding. And so one issue that we did see there specifically as well though, is that we as a communications team failed to articulate when the admin portal had actually been recovered. And so, you know, for customers, that's not a great experience, because you're gonna, you know, not be looking at the admin portal, you're gonna believe it's still affected, you're not gonna be there, you're getting the updated communications on the SHD, you're making the changes that you might want to be making to support your services, that sort of thing. And then the final big service, and then I'll call out some of the other ones, is we saw some impact within SharePoint and OneDrive. Now, this impact was primarily very regional. It was very located in the Central US. Basically, we had about 25 scale units or so that were affected. However, there is a good story on this, is that our telemetry did detect the issue and it began automatically failing over the service. So SharePoint and OneDrive saw about a 45-minute impact. And that's obviously longer than we'd like, but it still was able to self-recover to a large extent, which was great. I do not want to minimize that there were impacts in other M365 services: you know, Intune, OneNote, Defender, Power BI are all examples of other services. And if you look at the post-incident review, you can see some of the timeframes on those and their various recovery. And like I said, you know, a lot of this is dependent on which, you know, foundational shared services are being used, you know, how quickly we have an ability to perform a failover, being able to perform a failover in a safe way, that sort of thing. In terms of people reaction, M365 engineers very quickly did respond to the issues; each of the product teams got engaged quickly. You know, in some cases, like with SharePoint, OneDrive, to make sure that the automated failover tasks were taking place; in other cases, like Teams and admin center, being able to actually manually take some actions to help mitigate the impact as quickly as possible. So I think there was a lot of good work there, but also we do have a number of repair items that we're calling out for each of the services. - Got it. That's really helpful to understand, Dale. Thank you for running us through those different impacts. I did have a question around your kind of learnings and repairs. It sounded like some of your services will have some architectural repairs, but you also mentioned some of the incident comms team. Are there process repairs associated with that as well? - Yeah, I'll start with the incident comms part. I mean, so first and foremost, whenever we have a multiservice issue, you know, one of the challenges but also opportunities is to adequately and accurately explain what is the status of each of the individual services. And so in this case we've got some items that we're gonna try to address, to try and make sure that we're really keeping updated statuses going out and mitigating in the SHD or in our Twitter feed, X feed, sorry, that whenever we have something that gets resolved. You know, that's an ongoing learning and something that we will continue to focus on to try and make sure that we're giving accurate state of affairs within the service. - Thank you very much. And Dale, just as you are kind of representing Microsoft 365 incident comms, I happen to be standing next to the head of our Azure Incident Communications team. So at this point I generally turn my attention to my co-host and ask Sami: "How did Azure do? Brain Did we communicate? Did we keep our customers up to date?" - It's interesting, and it's a mixed story. We have our brain system, which is our automatic communication system to impacted customers. This fired pretty early, and we started communicating to impacted customers using certain services: not all, but for a vast majority. One of the challenges we have with brain, it's not great at correlating impact. So it says, okay, these virtual machines are impacted, these SQLs are impacted, these Azure Databricks are impacted, Cosmos is impacted, and it sends them separately. So there are some customers who are using multiple services who have received multiple tracking IDs, and they weren't strung together. When the comms team come in and they wanna update it, it's very difficult for our systems to update all of these tracking IDs and link them at the same time. So there would've been some customers who would've had a tracking ID or an event; they may have had the first communication, and then even so bad as to say only the PIR at the very end: like, that tracking ID wasn't updated. For the most part, we did try to keep customers updated. It was about 20, it was around 10:56 when we went to the status page, 'cause we realized that impact was growing and we weren't able to communicate to everybody effectively. Brain was still working, but we went to the status page. In the PIR, we spoke at about 45 minutes past midnight. We knew what the issue was, but we didn't communicate this until 1:20 or so. And even when we communicated that we knew what the issue, we didn't detail what the issue was: and it was almost like we kept it a secret. Part of this is fog of war, part of this is getting information. When we have such a broad outage with so many services impacted, it's important that we're trying to relay information that's helpful to customers as much as we did. The really important things to note is we need to do a better job in telling customers what we know and what we don't know. I thought it was great that we had paused deployments, as Tom Jolly mentioned earlier. We started looking at networking deployments. We paused everything. We ruled out networking deployments, we turned our attention somewhere else. It's important that we convey this to the customer during the outage. It's important that they understand what we ruled out and what we're still investigating. In addition to this, we should tell customers what we're doing about it. Are we rolling back? Are we updating our allow list? Are we failing forward? This will allow customers to make decisions as well. And then lastly, we need to be better at what we're seeing as a result of our actions. Are we seeing signs of recovery? Does it mean customers can have a a long lunch? Should they send people home early? Should they pack up for the day? All of these things, all of these decisions customers can make when they're informed. And so while we did tell customers in time it was laggy, we didn't know there was a fog of war because of the scale of this outage. We have learnings to go back and say our systems need to be able to communicate in a more coherent way, but we need to do a better job of telling our customers everything we know at the time. And so there are some pieces. There are other pieces that are a little bit tricky and nuanced. Azure DevOps has its own status page which it leverages. And the Azure status page as well has a line into DevOps. And a lot of those times, because a lot of Azure DevOps customers don't use subscriptions, for the most part, they don't use the portal, which means that Azure Service Health and all the benefits that come with using Azure Service Health isn't relevant for Azure DevOps. But the idea of having two sources of truth or two places where customers go to look at it is a problem. And so we're thinking on the Azure status page, we should just signpost to Azure DevOps, and then customers can make their decision there. So overall I think there are lots of learnings, lots of repairs, both from a systematic point of view, from a cultural point of view and being able to share information as it comes along. Saying that, it's easy in hindsight, and hindsight is always 20/20. But looking back, these are some of the issues. - Thank you for watching this Azure Incident Retrospective. At the scale of which our cloud operates at, incidents are inevitable. Just as Microsoft is always learning and improving, we hope our customers and partners can learn from these too and provide a lot of reliability guidance through the Azure Well-Architected Framework. - To ensure that you get post-incident reviews after an outage and invites to join these livestream Q&A sessions, please ensure that you have Azure Service Health alert set up. We are really focused on being as transparent as possible and showing up and being accountable after these major incidents. - Whether it's an outage, a security or a privacy event, Microsoft is investing heavily in these events to ensure that we earn, maintain, and at times rebuild your trust. Thanks for joining us.

Azure Incident Retrospective: Storage issues in Central US, July 2024 (Tracking ID: 1K80-N_8)

Share your thoughts