Azure Incident Retrospective: Storage issues in Central US, July 2024 (Tracking ID: 1K80-N_8)
Published: Aug 01, 2024
Duration: 00:50:27
Category: Science & Technology
Trending searches: azure outage
Introduction - Welcome to our Azure
Incident Retrospective. I'm David Steele. - I'm Sami Kubba. We work in the Azure Communications team. - in addition to providing a
written post-incident review after major outages. We also now host these
retrospective conversations. You're about to watch a
recording of a live stream where we invited impacted customers through Azure Service Health to join our panel of
experts in a live Q&A. - We had a conversation about reliability as a shared responsibility. So this includes our
learnings at Microsoft, as well as guidance for
customers and partners to be more resilient. - We hope you enjoy the
conversation that follows. - So with that, I'd like to
bring up our panel of speakers that are gonna be talking us
through the incident today. So who do we have up first? - Well, first, we have Tom Jolly, Vice President of our
Azure Storage Service. Welcome. Thank you for joining us, Tom. - Thank you, Tom. Storage will be a big
part of the play today. Next up we have Asad Khan, who's our Vice President
of Azure SQL Database. If you've read the post-incident review, you'll be familiar with the ways in which SQL handled or didn't handle different parts of this element. - Super, we have Kirill Gavrylyuk, our Vice President for Azure Cosmos DB, our globally distributed database. - Very good. Glad that
you could be here, Kirill. I think you've done one of these before. And next up, new to our
Incident Retrospective series, we have Scott Dallamura. Scott is the principal
software engineering, principal software
engineer for Azure DevOps. We're gonna talk a little bit about how impact to Azure
DevOps was a bit broader than some of the other
services that were impacted. And last up last? - Last but certainly not
least we have Dale Churchward, our service engineering
manager for Microsoft M365. He's here to talk and represent
the M365 side of the house, which had impact. To kick things off
first, over to you, Tom. Before we get into the incident itself, we mentioned something
about the allow list, and the importance of the allow list and how it played a crucial role in this. What do we need to understand? What is the storage allow
list and why is it so crucial? What is the Allow List - Sure, so actually, just
before I answer the question, I really just want to say, you know, I want to apologize honestly
to our customers here for the incident. It was a very long incident with a lot of impact for customers, and we fell, you know,
very short of our goals and expectations for, you know, the service that we
provide to our customers. So I just wanted to start out with that. To answer your question,
so the allow list, it's a sort of security
defense in-depth measure. So in Azure, when you
have a virtual machine running on a VM host, the disk is typically
hosted by storage clusters. And we want to make sure that, a sort of defense in-depth measure, that we only accept requests for disk I/O from IP addresses that are known to belong to a VM host machine. So we have this configuration that we're just calling
the allow list here, which is published to every
storage cluster in a region. And whenever a request comes
into that storage cluster, we'll check the IP against that list... For a disk, it will check the
source IP against that list. And if it's not coming from
a known VM compute host, the request will be failed and dropped. So obviously we build
hardware all the time, we decompile hardware all the time, and whenever a new rack
of compute hardware is brought live in the data center, the storage clusters all need
to know about the IP ranges that those hosts are gonna be using. And that means that every source cluster in the region at that point
gets a configuration update, to add that set of IPs to the allow list. And this happens kind of in
the workflow for, you know, deploying new compute hardware currently. And it's happening pretty much every day in any nontrivial region; this is happening every day, right? Hardware's being added
and removed every day, pretty much every day of the year. So that's kind of the background there. - Great, that helps us to understand what the allow list is and why we use it. I wanna ask you what broke, but I know, Tom, during
the post-incident review, we were talking, I really wanted us to
add a lot more detail into the post-incident review about the specific networking hiccup that got in the way of
generating this allow list. And I really liked the
point that you made: it actually didn't matter
why this specific thing got in the way of the allow list. It sounds like your repair items, that we'll get to later, are really to make sure that
the allow list, you know, can be generated no matter what. But could you talk us through what broke and why that usual process
of updating the allow list didn't happen as expected in
Central US on this fateful day? - Yeah, so the workflow
that generates the list What went wrong obviously pulls that information from some of our kind of data center, you know, source-of-truth systems. And on that particular day, the system where the workflow generating the analysis was running had a problem reaching
that source of information. And it resulted in it generating
a list that was partial. So the list that was
generated by that workflow at that time only contained about half of the VM
host IPs in the region. And then it was that partial list that was then deployed
to the storage clusters, causing them to drop VM I/O
requests from, you know, a lot of the VM hosts in the region. Now, that particular failure mode was relating to some network configuration on some backend infrastructure
where we host that, but that's not really the
key thing here, right? The key thing really is that
things are going to break; there might be other
things that can go wrong. And the real hole here was
the safe deployment gaps that we had in how we apply this list to the storage process in the region. - So Tom, for our
customers who don't know, just very quickly, our safe deployment is usually when we deploy
a feature or a service or we add something. We don't flip a bit and turn
it on globally across Azure, because we know that
software contains bugs and there are issues. And so we do this in stages and we do it very systematically. Now, I always typically
think about safe deployment being applied to features or services or we enhance something. Is it true that we should, are you saying that we should have been
using safe deployment even when we decommission and
even when we use buildout, when it's not software, when it's kind of
hardware-related as well? Configuration and code - Yeah, absolutely. I mean, configurational code, we treat as the same thing in general. So whether we're changing
a setting to, you know, might be to tune performance or it might be to enable new capabilities, we treat that like code. And it's a deployment. We
consider it a deployment. And it should follow safe
deployment practices. And those, I'm sure people
have heard this before, but that means looking, you know, sequencing changes through,
for example, regions, within a region through
availability zones, and for storage, even
within an availability zone, we have sort of a risk hierarchy
of different storage types. For example, we'll start
often by applying a change to a standard storage
geo-secondary cluster. And at the other end of the scale, for kind of risk or impact
is our premium storage disks, clusters, and also our ZRS,
zone-redundant clusters. And then of course, also, the pace of deploying a change, right? You start slow and small, and then as you build confidence with a configuration change or code, you can gradually sort of fan it out. But even while you're doing that, you need to have health
checks at all steps to make sure that there's
no signal telling you that, hey, something is not working
the way we expect here. We're impacting customers, or otherwise, you know, something is looking worse than it did before we applied this change. So we do that, obviously, not in this case,
which I'll come to in a moment, but we do that: all of our
deployment systems work that way for configuration and code. In this case, the big miss here was that this is sort of, this update of the allow list is really a region-wide
storage configuration change; essentially, that's what it is. And this was not using one of our standard
configuration rollout mechanisms. It was a very specific update flow that was living in this
compute buildout workflow, and it did not have any of
these protections in it. So when we had the problem we
talked about briefly earlier where there was a backend
infrastructure problem and we generated this
list that was partial, unfortunately, this
workflow then deployed it across all the storage
clusters in the region without any sequencing through zones, storage types, and without health checks. And that's sort of, you know, there's no way to justify
this or explain it. It's a really, really bad failure and badness on our part, right, That we had this piece of
workflow that did not have STP and was not using a normal
configuration deployment system. - Yeah. Thank you, Tom. I appreciate you breaking it down. And exactly as you say, right, we're not justifying what happened; we're explaining what happened, and I appreciate you explaining the room for improvement there from a safe-deployment perspective. Before we get to those
learnings and repair items, I did want to ask, I'm not sure if you were on call yourself, but I know your team was heavily
involved in this incident. Yeah, what did that look
like as far as our detection and investigation and mitigation? How did we respond and how did
we know what was happening? Challenges - Hmm. Yeah, so one of the
challenging parts of this one was that because the
impact was region-wide and it affected a lot of VMs, it also affected some of
our own infrastructure that we use, typically, when we're trying to troubleshoot
a live site incident. So we did lose some, not in terms of backend
management APIs and portals, but some things we use for analysis we didn't have available. That slowed us down a little bit. And then I think the other thing was because it was so
widespread, and initially, there was some networking
deployments identified that had gone out to compute
hosts at around the same time, and that was the first suspect
for the cause of the problem. So initially, there were a lot
of people involved in this, but initially that was the belief, that that was causing the problem, that somehow the compute host
IPs may have been changed, meaning that they didn't match what was configured on storage. So it did take a bit of
time before we realized that actually, no, that's not the problem. The problem is the storage-side
configuration has changed and is just wrong and missing
a huge number of the IPs. So it did take a while
to get to the bottom of, you know, realize that
this other deployment that had happened for
networking on VM host was actually not the cause, and it was actually a storage-side
configuration problem. And this is another gap we had here; We didn't have alerting on
the storage side for allow, you know, rejecting requests because the IPs are not in the allow list. We had metrics for that; we did not have alerting on that. And one of the systems
that would've told us that was the problem, we have
automatic triage and analysis that looks at a lot of
different things in the system and points our engineers
usually in the right direction. That was, unfortunately,
because of the scale of the impact here, that was one system that was affected by the outage itself. So it did actually
identify exactly, you know, correctly the problem, but it didn't do that until
after we'd mitigated the outage, because it was not able to work. So that was another kind of repair. And the other thing that
slowed us down a little bit, ironic really, is that
the tooling we needed to mitigate this, we needed to
apply a configuration change to all the storage scale units
in the region again, right? And we didn't have tooling
ready that will do that at the speed that we really
wanted to in this case. 'Cause usually we don't
want to do things fast; we don't want to apply changes across the whole region quickly. So that's another bit of a
learning here is that, you know, sometimes we really need
to be able to do that, and we need a safe and well-understood way to do that in these rare situations where we may want to
sort of blast a change to mitigate an incident. - Thank you, Tom. So in addition to this, and we speak about the
alerting was down as well because we had a circular dependency on the systems that are up, and in addition to this,
we're talking about- - Sorry, the alerting wasn't down. What was down was... We have an auto analysis,
auto triage system, which looks at lots of
things and basically says, hey, it thinks this is the
problem: that was down. Alerting was working, but we had missing alerting
on this failure metric that would've, you know, would've enabled us to
figure out immediately, hey, this is the problem, most likely, rather than having to look
across networking and everywhere to try and figure out what was going on. Sorry, go ahead. - No, that's a helpful
clarification. Thank you. And then of course, in changing, making sure that we
follow a safe deployment or a STP kind of process as well. Are there any other learnings
that storage is taking away considering it was such a big event? - Yeah, I mean, there are
learnings and repairs, right? Learnings Repairs I mean, in terms of repairs, we've deployed a lot of
repair items already. So for example, we now have alerting on the
allow list rejection rate. So that will immediately, you know, call a storage engineer if
there are any requests failing because of the allow list not
matching the VM host source. And then in terms of the workflow that was deploying this allow list change, that was the big hole
here, the big failing. That we've already updated. It now has full STP
sequencing across zones, across storage clusters,
types of storage cluster. The rollout time is extended as well. Previously, it was just
going as fast as it could. So I think it went through
US Central in around an hour. Now this happens over a 24-hour period. And in addition to all the sequencing, we have the health checks in the job, in the workflow here, that is monitoring for
these failures all the time. And it will halt the job if
there were to be any failures. And then of course the
other piece is making sure that we're not generating a partial list. So the checks for any
missing source information are now in place as well. So even if we had a repeat of
the backend infrastructure, you know, network accessibility issue that caused this job to
generate a partial list in the outage, now we would stop the workflow in response to any failures in kind of getting the source information. Longer term, we want to
move this configuration job just out of the buildout workflow. You know, ideally, it wouldn't be there and be so time critical. So that's a longer-term repair, is to look at a different mechanism here. And then in terms of other learnings, we are doing or we have done another scrub of areas like buildout, and, for example, to really
make sure there's nothing else that has somehow snuck in
there that should be going through a different deployment mechanism or has missing STP checks. So we've obviously done a
lot of investigation there to make sure we're not missing
anything else in this class. - That's great. Thank you, Tom. I appreciate you kind of
running us through that. I know you've got specific
ETA dates of, you know, other repair items and in the
post-incident review there. And if you are watching us live, I'll encourage you, if you've
got any follow-up questions on anything that Tom's mentioned, any element of the storage incident, any of the repair items, we've got storage subject
matter experts standing by to answer your questions. So please take advantage of that Q&A panel on the top right corner. Tom, thank you, that helps to cover us the kind of storage side of things here. So next up I'd love to move over to Asad. Asad Khan runs our SQL Database team. Now, obviously, Asad, SQL
has a dependency on storage. Those databases have to live somewhere. Could you help us to understand what this incident felt
like to different customers? I understand it varied depending on how they'd
configured their SQL? - Yeah, thank you. Impact So as you said, the
impact was pretty wide, because, obviously, SQL depends
on the storage and the VMs. And it was a region-wide impact. And as a result, both
the management operations as well as the connectivity had an impact and the
databases were unavailable. Now, SQL had its own alerting mechanism, so we were able to detect it
within the first few minutes that the connectivity is dipping. And we reached out to the
customers, we sent the comms. But since this was region-wide
impact across all AZs, the only guidance we could give was that if customers have
their DR strategy in place, they should execute on that thing. And then there's a part of
it where we have a feature where you can also do
Microsoft-enabled thing, where Microsoft will do
the failover for them. Now, it has some downsides, and we can go into the
discussions for that as well, but as per the documentation, we did the failover for those customers in the first one hour. - Super, Asad, it's a great talking point, because customers have the option to say, "Hey, I would like Microsoft
to manage my failover." And they don't really get
a say in what happens. And so you'll fail customers
over, as your playbook, but I understand that can be
a heavy lift for customers, and it can introduce problems
depending on the customers. Could you talk a little
bit more about, you know, the advantages and
disadvantages of the customer, the managed failover that Microsoft does, - Right, yeah, no, absolutely. So from the disaster
recovery, there are two parts. Disaster Recovery One is that within the same region you can choose to run your database across multiple availability zones. And when you do that, there
is no burden on the customer. If one availability zone goes down, the database will continue to be available and we will be using the
other availability zones. Now, if there is a region-wide outage, which happened in this case, and all availability zones are impacted, then the option is that
you have to set up a DR in a peer region. And obviously, this you
have to do ahead of time. And in that case you have, think of it, you have an async copy of that database that is in a different region, and we are continuously doing the sync between the primary and the secondary. Now, you can also choose to
have it as a read replica, and there are other other options as well, but in case of the disaster,
a customer can choose to say, okay, the database is
unavailable for whatever reason. It could be region outage, it
could be one database issue, it could be SQL engine issue, connectivity issue through
the gateway, anything, but my database is unavailable
so I want to do a failover. When you do the failover,
this becomes your secondary and the primary becomes what originally your secondary one was. And that is how you get
back the availability for the database. Now, originally, we were not very clear in terms of which one is
a more preferred option. We were leaving it to
the customer to decide whether they want to do the failover or they want Microsoft to
automatically do the failover. And our documentation says
that if there is an outage and it is longer than one hour, then we will do the failover. Now, the thing is that
it is a failover group, which means it is not
about a single database; it is a group of databases. So there is some gray area, like, should we do a failover if half of the databases are unavailable, or should we do the failover even if a single database
is not available? The second thing is that
when we do the failover, we will do the friendly failover, which means we will
ensure that the secondary has fully caught up with the primary; but if that does not go through, we will do a forced failover, because at that point we are like, customer is saying the
availability takes precedence, even if not all data is is fully saved. So now you can see, like,
we are making decisions on behalf of the customers, and that it is better for customers to make those decisions for them. The last part is that obviously an app contains lot of components, and just failing over the database means that it might be that your compute is running in a different region and your database is
in a different region, may not be ideal for many reasons. Obviously, the latency
itself plays a big role. And the last part is that even
when the original database comes back online or the
region has fully recovered, then we don't failback. Because now, as I said, like, it might be that you
are running the entire stack in a new region. So in summary, what we have seen is that customer-managed
DR is always preferred. We still provide you the the
Microsoft-managed DR thing, but more and more through documentation and the product experience, we are guiding customers
how to set up their DR and how they take control of that thing. Because, as I said, like,
it's a very rare operation that you have to do, and it's best if the
customer makes the decision at when to do the failover. - That's great, thank you
for clarifying that, Asad. I'm interested, since we're
talking about the failover and the failback anyway, we mentioned in the post-incident review that at one point most of the
databases came back online, but there was some extra manual
work needed to recover them. Since you said we don't do the failback, I presume that was for
the initial failover. Could you help us to understand what kind of hiccups were faced
and how widespread that was? - Yeah, that was the
initial hiccup that we hit. And just to be clear, and we put it in the document as well, that the Microsoft-managed failover, the number of the database
which had that setup, was 0.01%. So it was a very, very small group. And for those databases, a subset of the databases, when we did the failover, the connection was still pointing towards the original primary. And that is something which
is in the repair item, which we will be finishing
in the next couple of months. But that is something, as you called out, it was an additional issue. And then when the region did come back, we were able to recover
80% of the databases in the first two hours. Then in the next hour we had
98% of the databases recovered. And then there was some long tail in terms of getting to 100%. - So thank you very much, Asad. I know Tom went into detail, I mean storage being the
trigger of the incident, into a lot of the
learning and repair items; from a SQL perspective,
I'm interested to know, our customers are interested to know, what have you learned from this and what are the repair items that you'll be implementing
carrying forward? - Yeah, I think the key
thing is, on our side, a couple of things. One is that from the guidance
through the customer, I would say multi is still
the most important thing. I know it did not help in this case, but trust me, like in 99% of the cases, that is the biggest savior. The second thing is that the DR strategy has to be put in place, whether you do it yourself or
even if you ask us to do it; that one, it is super critical. On our side, a couple of the repair items, one is more clarity, which
I was just describing in terms of how the
Microsoft-managed DR works is one. The other thing is that the redirection that you pointed out, that there was a subset of the databases, when the failover happened, the redirection did not happen
on the connection level. And that is a super critical one for us. That one has no excuse, like why the redirection
should not happen right away. The third one is that we
are working very closely with the Azure Core in
terms of how the OSS images are put on the storage. And as a result, the storage outage also impacted on the compute side, and how the ephemeral
disks can help over there. And that is something SQL will
take the benefit right away as Azure Core delivers that capability. - That's great. Thank you very much, Asad. I'll now turn our attention
over to Cosmos DB. Kirill is here from Cosmos DB, and I wanted to start with you. Kirill, I understand
most people use Cosmos DB precisely because it's multiregional and the benefits that it provides here. But I saw in the PIR that
the impact of this outage on different customers depended
on how they configured it. So could you help us to understand kind of which customers felt the most pain relative to their configuration? - Absolutely. First of all, I wanted
to join my colleagues Cosmos DB in apologizing for this
really unprecedented outage. I don't know, in my memory, there was no such outage on Azure earlier. And when it comes to Cosmos DB, customers first always sometimes wonder, given documentation states that the data is stored on local SSDs, why was Cosmos DB even impacted by this? It's a great question. And I think as Asad alluded
to that question a little bit, is while the data is
stored on local drives, the VMs that serve the data, they run on Azure VM scale sets and use storage to store the OS drive. And that's why half of Cosmos DB nodes went down during this outage. And that's something that
Azure Virtual Machines is working on. And we will deliver the capability where the OS drive can be
cached locally on the host so that even if storage
goes down next time, services like Cosmos DB
that don't use storage other than through virtual machines will not be impacted anymore. Now, when it comes to
global database, yes, Cosmos DB has multiple configurations. The golden standards, the true global database
configuration is the active/active. This is a configuration where
application can write and read into any of the regions
enabled for the database. In this case, even if
one region goes down, application continues to write and read into the remaining regions. Typically, a good idea is to
architect your application as well as active/active. And not every application
can be done easily this way, but it does require some work. But we have customers,
like Walmart, for example, runs fully active/active, right? And so a single region outage does not impact Walmart operations. Now, there are other
ways to configure Cosmos. So these customers were not
impacted by Central US outage. If your database was enabled
with multiregion rights, the application automatically would redirect writes and
reads into healthy regions. For customers who were
configured with active/passive, where reads are global but
writes go to one region, if that region happened to be Central US, those customers lost
the write availability. In this case, either
Microsoft or a customer have to perform an operation to move the write region somewhere else. Or as what we call it, offline the region. Typically, Microsoft does it. Recently we've learned, as Asad alluded, is that in many cases, we we are making decisions
on behalf of the customer, and that's not always optimal. And so we already rolled
out to some customers, capability to do offline
the regions themselves. And it's proven that
it's a better strategy. It's always a better strategy if customer knows exactly
what's good for them. And in this particular case,
it was actually better strategy because customers will be able
to more efficiently do this. Now, some customers did it; for many customers, Microsoft failed over or offlined Central US region for these active/passive accounts. We first had to offline, failover our control plane, and we started offlining Central US region for active/passive accounts
within the first 40 minutes. So it was very quick. The outage was caught within
a minute of the impact, and the process flows, it's semiautomatic, and we started offlining the
regions for impacted databases. Now, because some of
our internal automation was also impacted, it was
really an unprecedented outage, this automatic detection of whether a database is
impacted or not had flaws. As a result, we prioritized, made a decision to
prioritize database accounts that were reported through support. Now, that strategy also has flaws, because sometimes customers
cannot even reach support in these situations, right? It's not always because, you know, portal may be down or some
other things can happen in these outages. So customers who were able to do this offline regions themselves
were in the best case. We offlined, and 95% of those offlines that we performed went smoothly. It takes roughly 10 minutes on average to offline the database account. Then of course there were some subset, there was some subset of database accounts for which a client had to be rerun. And I think the document
describes two configurations. It was a fairly small subset of customers, but there were two configurations: one with some MongoDB API customers, and another was some
private endpoint customers where we had to redo... sorry, some MongoDB API customers. And then when we failed back, there was some of the
private endpoint customers were affected during failback that required us to redo the failback. Of course, if customers do not use global database capabilities of Cosmos DB, and in the single region,
for a region-wide outage, they were impacted and didn't
really have much recourse until the availability of the region, of the storage came back and DMs came back and we restored availability of Cosmos DB. - That's great. Thank you, Kirill. I was gonna ask you more about the failover and the failback, but I say you've covered that really well. I wanted to ask about the
learnings and repairs, particularly because there's one that I had to read three times, and I'm hoping you can
help me to explain it here. One of the learnings was adding automatic per partition failover for multiregion active/passive accounts. Could you explain that so that
my mom could understand it, and any other kind of learnings or repairs from a Cosmos DB side? - Absolutely. Generally, for global
distributed databases, Automatic Failover global databases, when data plane decides to mitigate, it's always the best because it scales, it has no impotence, it
can make local decisions, what is best for this particular node, for this particular partition, versus control plane, a separate service
looking at the data plane and lots of failover
modes in there in between. So per partition automatic failover effectively allows us to only failover parts of your database, only those partitions that are impacted, automatically, based on the view that a quorum of observers
has for that partition, which is a lot more precise than us guessing based on telemetry or even customer guessing based on what they see
in the portal, et cetera. It's immediate, it's real time. We have exact view, a quorum
view of what's going on with the partition, and we can failover. So that happens automatically without any intervention
from the customers, and it happens only for
those subset of the data that is impacted, without having to failover
the entire set of accounts. Again, this capability only makes sense for this active/passive set of accounts: best if you all just use active/active. And more and more, it used to be a cost, there was some cost impedance; with dynamic autoscale capability, that cost impedance practically went away. It's a fairly cost effective way to achieve the five nines resiliency and never have this headache. But if your application, this doesn't really require
semantics of active/passive, this per partition after failover is gonna dramatically change
the experience during, hopefully not happening,
outages like this, where it's done directly by the DataBlade, directly by the nodes based on the quorum, and application doesn't
have to do anything. - Super. Thank you very much, Kirill. It's fascinating to hear about Cosmos DB. It's a globally distributed database; If you configure it the right way, you can be immune to a lot of outages. That's great to hear. Thank you. I'd like to turn my attention to Scott who works with Azure DevOps. Scott, we have Azure DevOps, or ADO, running in multiple regions; and in this case the
impact was to Central US, but was there impact beyond
Central US for Azure DevOps? Outside of Central US - Yeah, unfortunately, yeah, there absolutely was impact, or to customers outside of Central US. So Azure Ops, it's like
a global service, right? It's available pretty much everywhere. Central US isn't really like
a central point of failure, although it kind of felt like
it for some customers here. So some historical reason for that, like, DevOps has been
around for a long time. It's gone through a lot of growth. It started out in the US only; actually, it started out on-premises, and we migrated to the
cloud a long time ago. But now it's available worldwide, and we have scale units I think in like eight geographies or so. And without getting into
too much sausage making, an Azure DevOps organization, like when you use Azure DevOps, like, you have a hosting location, which is basically where like
the portal service lives, this is the web app that you see when you use like pipelines
or Git or work items, that's hosted in a particular region. And outside of that we have a bunch of other
supporting services. Some of these are critical,
some of them are not, that serve the overall experience. And for some customers, you know, for historical reasons,
right, like as we've grown, not all the data has migrated
out of some of these things, but not all the data is in the region that your portal services; some of it might still be in Central US. In other cases we have
architectural concerns. Like, we specifically
store your profile data in either the US, the EU, or UK. And if you're not in the US, EU, or UK, your profile data basically
defaults to the United States; which means, for example, customers in Brazil have
a pretty good chance of being impacted when
a US region is down, because their profile data goes to the US. So considering all that, it's not really impossible
for a full outage in something like Central US to be felt by customers in another region. Most of the impact that we saw outside of the US was in Brazil, but there were definitely
customers in Europe that felt this too. The way that impact manifests
depends on, you know, which one of these supporting services might be in that failed region or which scenario you're
looking at, right, 'cause, you know, some scenarios
don't touch certain data, but it's definitely possible. Again, it's not super likely, like the last time I ran the numbers, I think like less than 5% of our customers that were housed or hosted
outside of Central US had some kind of impact. But yeah, that doesn't really help, but it doesn't feel unlikely
when it happens to you, right? And I'd like to apologize too, by the way. You know, I've been using
Azure DevOps for like 12 years. Our whole team has been
using it for a long time. We love the product. We feel it when, you know, when it doesn't work for customers. You know, we use it day to day, so it's not great when it goes down. - Right. Yeah, thank you, Scott. Thanks for explaining that. So like you say, for a lot of customers, they didn't feel the pain, but for the ones who did, that this was really impactful here. So can you help us to understand some of the learnings or
repair items from the ADO side? I saw in the post-incident
review we mentioned, migrating metadata. I wasn't sure if that's
on us, as Microsoft, to do the migration. It sounded like it's also
possible for customers to migrate that metadata themselves? - So this would be something
that Microsoft would do, right? Like, we have, you know, we have a list; we know what data's in
the wrong spot, right? So we've been working on this. We call it a reconciliation, you know moving this data out of, you know, where it is, for whatever reason, whether it used to be there
or it got placed there, you know, for whatever reason. But we're reconciling those.
It's an ongoing process. I'm not super clear on the ETA. It's gonna be a couple months, probably. We'll speak to that more I think in the PR when we finally publish it. We will have recommendations
for customers. And like it is possible for
you to change, you know, your hosted region in Azure DevOps, and it's not a trivial option. It may incur downtime
depending on how big you are. I wouldn't recommend
that as like a mitigation for this particular scenario. I think the better thing
to do is for Azure DevOps to actually make sure that
we do the right thing, and, you know, have customer data where they expect it to be, you know, to eliminate
some of these dependencies we have on certain services that may or may not be
critical for certain scenarios. You know, some cases, like, for example, profile data, right? Like, you know, it's not
distributed everywhere, right? So that might be something
that we look at too, is, you know, maybe spinning
up some more scale units of some of these services so that they can better
serve other regions. And then, like, you know, we also, you know, as part of all the
deep dives we've done so far, digging into like, you know, what a certain impact looks like, you know, we have definitely
have opportunities to like, you know, just behave better in the case of an outage, for things especially
that aren't critical. A lot of customers who saw like, you know, not necessarily failures but like just delays in the experience, were basically because we have some spots where like we'll do like
an exponential backoff for something that isn't really critical. Something that we could
have just like, you know, flipped a circuit breaker,
skipped that for now, you know, would've provided
a much better experience for a lot of customers. So those are the repair
items we're looking at. - Super, thank you very much, Scott. Dale, if I could bring
you into the mix now. Thank you for waiting so long. Dale, there was impact to M365, which is our SaaS solution that, you know, that's running in Azure. And SaaS in my mind, there
should be minimal impact. But I'm so interested to know
that there was impact to M365. And could you please tell us
how did your systems respond and how did your people respond when you have an outage caused by Azure? M365 - No, thank you, Sami. So I would say, you know, we saw varying levels
of impact within M365, and the reason for that
is, there's a a few things. First and foremost of course is that we have dependencies within M365 on kind of foundational services that we run behind the scenes that help support the main
services customers see. In addition, of course,
we also have dependency on things like Cosmos DB
and SQL Server and Azure, which of course you've
already kind of heard what some of the impacts there were. But fundamentally, because
our services, you know, are architected different ways and serve customers in some
different methodologies, we did see quite a variety of impacts. So just to kind of give some
specific examples, you know, if you look at something
like Veeva Engage, that was probably the thing where we saw the broadest level of impact. And the reason for that
is that Veeva Engage is hosted generally out
of the North America and European regions,
regional data centers, and so this issue had global impact because of course if you were hosted in the European data centers, you wouldn't have seen impact at all. However, we do have customers
in the Americas, Japan, Australia, and other
Asia Pacific countries that would have experienced the impact, not only because of Veeva Engage being run in US data centers, but also because of the fact that it was during their business hours. And so that was probably the
single broadest impact we saw. But also, you know, we did see some fairly significant
impact in Teams as well, primarily because users
were unable to join or initiate calls or meetings. Some of this functionality was also related to if you
were in a call or a meeting. So examples of not being able
to mute or unmute people, raise hands, remove users; also, presence information was affected. And even once Teams recovered, we actually continued to
see some residual impact with presence information. And so that's something
that the engineering teams are looking at to try and resolve. The admin center was affected, although that one's a
little bit different. I mean, you basically had some users who were unable to
access the admin center. But I do wanna call out that even at the height of the issue, we were still seeing an 85% success rate of people being able to
connect to the portal and be able to use
things within the portal. And so certainly not to
minimize the impact there, but we were still seeing most transactions to the portal succeeding. And so one issue that we did see there specifically as well though, is that we as a communications team failed to articulate when the admin portal had actually been recovered. And so, you know, for customers, that's not a great experience, because you're gonna, you know, not be looking at the admin portal, you're gonna believe it's still affected, you're not gonna be there, you're getting the updated
communications on the SHD, you're making the changes that
you might want to be making to support your services,
that sort of thing. And then the final big service, and then I'll call out
some of the other ones, is we saw some impact within
SharePoint and OneDrive. Now, this impact was
primarily very regional. It was very located in the Central US. Basically, we had about
25 scale units or so that were affected. However, there is a good story on this, is that our telemetry did detect the issue and it began automatically
failing over the service. So SharePoint and OneDrive
saw about a 45-minute impact. And that's obviously
longer than we'd like, but it still was able to
self-recover to a large extent, which was great. I do not want to minimize that there were impacts
in other M365 services: you know, Intune, OneNote,
Defender, Power BI are all examples of other services. And if you look at the
post-incident review, you can see some of
the timeframes on those and their various recovery. And like I said, you know, a lot of this is dependent
on which, you know, foundational shared services
are being used, you know, how quickly we have an
ability to perform a failover, being able to perform a
failover in a safe way, that sort of thing. In terms of people reaction, M365 engineers very quickly
did respond to the issues; each of the product teams
got engaged quickly. You know, in some cases, like
with SharePoint, OneDrive, to make sure that the
automated failover tasks were taking place; in other cases, like Teams and admin center, being able to actually
manually take some actions to help mitigate the impact
as quickly as possible. So I think there was a
lot of good work there, but also we do have a
number of repair items that we're calling out
for each of the services. - Got it. That's really
helpful to understand, Dale. Thank you for running us
through those different impacts. I did have a question around your kind of learnings and repairs. It sounded like some of your services will have some architectural repairs, but you also mentioned some
of the incident comms team. Are there process repairs
associated with that as well? - Yeah, I'll start with
the incident comms part. I mean, so first and foremost, whenever we have a
multiservice issue, you know, one of the challenges
but also opportunities is to adequately and accurately explain what is the status of each
of the individual services. And so in this case we've got some items that we're gonna try to address, to try and make sure that we're really keeping
updated statuses going out and mitigating in the SHD
or in our Twitter feed, X feed, sorry, that
whenever we have something that gets resolved. You know, that's an ongoing learning and something that we
will continue to focus on to try and make sure that we're giving accurate state of affairs
within the service. - Thank you very much. And Dale, just as you
are kind of representing Microsoft 365 incident comms, I happen to be standing next to the head of our Azure Incident Communications team. So at this point I generally
turn my attention to my co-host and ask Sami: "How did Azure do? Brain Did we communicate? Did we keep our customers up to date?" - It's interesting,
and it's a mixed story. We have our brain system, which is our automatic
communication system to impacted customers. This fired pretty early, and we started communicating
to impacted customers using certain services: not all, but for a vast majority. One of the challenges we have with brain, it's not great at correlating impact. So it says, okay, these
virtual machines are impacted, these SQLs are impacted, these Azure Databricks are impacted, Cosmos is impacted, and
it sends them separately. So there are some customers
who are using multiple services who have received multiple tracking IDs, and they weren't strung together. When the comms team come in
and they wanna update it, it's very difficult for our systems to update all of these tracking IDs and link them at the same time. So there would've been some customers who would've had a
tracking ID or an event; they may have had the first communication, and then even so bad as to say
only the PIR at the very end: like, that tracking ID wasn't updated. For the most part, we did try
to keep customers updated. It was about 20, it was around 10:56 when
we went to the status page, 'cause we realized that impact was growing and we weren't able to communicate
to everybody effectively. Brain was still working, but
we went to the status page. In the PIR, we spoke at about
45 minutes past midnight. We knew what the issue was, but we didn't communicate
this until 1:20 or so. And even when we communicated
that we knew what the issue, we didn't detail what the issue was: and it was almost like
we kept it a secret. Part of this is fog of war, part of this is getting information. When we have such a broad outage with so many services impacted, it's important that we're
trying to relay information that's helpful to customers
as much as we did. The really important things to note is we need to do a better
job in telling customers what we know and what we don't know. I thought it was great that
we had paused deployments, as Tom Jolly mentioned earlier. We started looking at
networking deployments. We paused everything. We ruled out networking deployments, we turned our attention somewhere else. It's important that we convey this to the customer during the outage. It's important that they
understand what we ruled out and what we're still investigating. In addition to this, we should tell customers
what we're doing about it. Are we rolling back? Are
we updating our allow list? Are we failing forward? This will allow customers
to make decisions as well. And then lastly, we need to be better at what we're seeing as
a result of our actions. Are we seeing signs of recovery? Does it mean customers
can have a a long lunch? Should they send people home early? Should they pack up for the day? All of these things, all of these decisions customers can make when they're informed. And so while we did tell
customers in time it was laggy, we didn't know there was a fog of war because of the scale of this outage. We have learnings to go back and say our systems need
to be able to communicate in a more coherent way, but we need to do a better
job of telling our customers everything we know at the time. And so there are some pieces. There are other pieces that are a little bit tricky and nuanced. Azure DevOps has its own
status page which it leverages. And the Azure status page as
well has a line into DevOps. And a lot of those times, because a lot of Azure DevOps customers don't use subscriptions,
for the most part, they don't use the portal, which means that Azure Service Health and all the benefits that come with using Azure Service Health isn't relevant for Azure DevOps. But the idea of having
two sources of truth or two places where customers
go to look at it is a problem. And so we're thinking on
the Azure status page, we should just signpost to Azure DevOps, and then customers can
make their decision there. So overall I think there
are lots of learnings, lots of repairs, both from a systematic point of view, from a cultural point of view and being able to share
information as it comes along. Saying that, it's easy in hindsight, and hindsight is always 20/20. But looking back, these
are some of the issues. - Thank you for watching this
Azure Incident Retrospective. At the scale of which
our cloud operates at, incidents are inevitable. Just as Microsoft is always
learning and improving, we hope our customers and
partners can learn from these too and provide a lot of reliability guidance through the Azure
Well-Architected Framework. - To ensure that you get
post-incident reviews after an outage and invites to join these
livestream Q&A sessions, please ensure that you have Azure Service Health alert set up. We are really focused on being
as transparent as possible and showing up and being accountable after these major incidents. - Whether it's an outage, a
security or a privacy event, Microsoft is investing
heavily in these events to ensure that we earn, maintain, and at times rebuild your trust. Thanks for joining us.