Redmond tries the "turn it off and turn it on again" trick after UX improvement update breaks M365 Portal
Isn't it ironic?
Users hosted in Africa, Europe, and the Middle East were struggling to access the Microsoft 365 Admin Portal today amid widespread issues triggered by a buggy update making it to production. Microsoft said it was having to manually restart infrastructure to help roll back an update that appeared to have caused issues.
"The revert of the update is taking longer than expected to complete. We're continuing our efforts to resolve the issue by also manually restarting the affected infrastructure to expedite the recovery" the company said shortly after 11am on February 3 as customers queried the regular time-outs that they were seeing on the portal.
An earlier Microsoft update said -- to the tune of a well known Alanis Morissette song -- that a "recent service update designed to improve user experience is causing impact". That line was rapidly updated.
"While we're focused on remediation, admins can use the following: Azure portal for User management, Licence management. Exchange admin center for Mailbox migration and Distribution lists. Exchange admin center, Teams admin center and other service specific admin centers can also be accessed directly through their links" Microsoft added in an update at 11:46 GMT. Reports of issues stretched back to early evening on Wednesday.
Follow The Stack on LinkedIn and say hi to the team...
It's been a bumpy start to the year for Microsoft on the resilience front: recent outages include a January issue stemming from the Azure Resource Manager blamed by the company on a "a code modification which started rolling out on 6 Jan 2022 [that] exposed a latent defect in the infrastructure used to process long running operations... over the course of hours, the job executions shifted entirely away from the regions that had received the new code to their backup paired regions. The impact spread to backup paired regions as the new code was deployed, resulting in job queue up, latency delays, and timeouts. In some cases, the jobs executed with such prolonged delays that they were unable to succeed, and customers will have seen failures in these cases.
"As a result of the way that the job execution infrastructure was implemented, the compounding failures were not visible in our telemetry -- leading to engineer’s mis-identifying the cause initially and attempting mitigations which did not improve the underlying health of the service," Microsoft admitted last month.
Just yesterday (February 2, 2022) meanwhile Microsoft admitted that "customers using Azure AD may have experienced failures when performing any service management operations" blaming (quelle surprise) "a recent change to a dependent service [that] caused the impact" which they promptly rolled back.
A 2021 RCA from Microsoft suggests that, in theory, this shouldn't be happening.
"Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment" it notes. It blamed a "latent defect" on that occasion too (this time "in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes".) We're not sure why this keeps happening either...
Just how broken is Microsoft's QA? Or are we being unfair? Share your views...