Show Notes
Azure AD utilizes keys to support the use of OpenID and other Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.
Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at
19:00 UTC on 15 March 2021, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.
Service telemetry identified the problem, and the engineering team was automatically engaged. At
19:35 UTC on 15 March 2021, we reverted deployment of the last backend infrastructure change that was in progress. Once the key removal operation was identified as the root cause, the key metadata was rolled back to its prior state at
21:05 UTC.
This is the second time in six months that Azure AD has gone down.
This happened 6 months ago. These are growing pains for Microsoft's cloud endeavors, and the ops teams involved need #hugops. Microsoft being the "safe bet" for enterprises means in part being stable, and two enterprise outages in 6 months is a lot.