• Patch Lady – what gives?

    There is something I don’t get.

    I do get that there is still a lot of people running Windows 7.

    I do get that there is a fair amount of discontent in the technology communities surrounding Microsoft.  I see many complain about the lack of quality in updates, in the inability to know exactly what Microsoft is tracking, in the inability to know for a fact whether or not your device will survive a feature update.  All of these are tied to what I’m going to call the traditional desktop model of Microsoft.

    And yet, Wall Street, which always has a hyper view of the future is saying that everything is rosy.  And yet, the future of Microsoft is still based on the code that we run in Windows 10.  Granted it’s a much more slimmed down, less bloated version of what we run, but it’s still prone to issues.  Case in point the Multi factor issues of the last few days.

    There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time. 
    The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event.

    Let me translate:  We installed a software update and it caused an issue.  We weren’t paying attention and it wasn’t until our customers were impacted that we realized we had a problem.

    Gentlemen… that’s what you promise when we move to the cloud.  That YOU are in charge of the updating and can fully monitor and ensure that nothing like this happens.  Yet you blew it.  With a piece of software/policy (multi factor authentication) that is a must have for anyone installing anything on cloud services.

    Then a few days later you blew it again:

    As described above, there were two stages to the outage, related but with separate root causes.

    • The first root cause was an operational error that caused an entry to expire in the DNS system used internally in the MFA service. This expiration occurred at 14:20 UTC, and in turn caused our MFA front-end servers to be unable to communicate with the MFA back-end.
    • Once the DNS outage was resolved at 14:40 UTC, the resultant traffic patterns that were built up from the aforementioned issue caused contention and exhaustion of a resource in the MFA back-end that took an extended time to identify and mitigate. This second root cause was a previously unknown bug in the same component as the MFA incident that occurred on 19 of Nov 2018. This bug would cause the servers to freeze as they were processing the backlogged traffic.

    Let me translate again:  Someone or something probably sent a wrong PowerShell command out causing the domain name system to fail which in turn caused the MFA system to fail.  Then you had a second software induced bug in the software that wasn’t properly diagnosed until your customers were impacted.

    To me these two events in close sequence indicate that for all that telemetry that is deemed to be so effective at allowing Microsoft to monitor, control and contain issues…. it really isn’t as good as it should be.  I’ve always said about telemetry that if it does what it’s supposed to… to allow our vendors to better understand how hard it is to maintain their software… bring it on.  Do more of it.  Disclose to me what you are looking at.  But stop using me as your beta tester and learn ahead of time not to blow me up.

    Authentication has to be rock solid.  Multi factor – even more so.  And communication regarding the impact could have been better.  I saw many saying that they had a hard time finding out information regarding this outage.  Bottom line, Microsoft blew it.  Showcasing that the investors may think things are wonderful, but for Microsoft, in technology, this week wasn’t so good.

    If you were impacted and want to provide feedback on how they should make communication better, take the survey.