• Bad antivirus definition triggers shutdowns

    Home » Forums » Newsletter and Homepage topics » Bad antivirus definition triggers shutdowns

    • This topic has 51 replies, 25 voices, and was last updated 10 months ago.
    Author
    Topic
    #2689055

    ISSUE 21.29.1 • 2024-07-20 By Susan Bradley It was a really bad day for IT admins. Late Thursday night, the security protection company CrowdStrike se
    [See the full post at: Bad antivirus definition triggers shutdowns]

    Susan Bradley Patch Lady/Prudent patcher

    Viewing 31 reply threads
    Author
    Replies
    • #2689069

      Why would many Microsoft cloud services were affected? they all use Crowdstrike?

    • #2689084

      Hopefully respondents to this Bad antivirus definition triggers shutdowns thread will focus on the technical issues related to the recovery from the CrowdStrike outage vs. a free-for-all pointing fingers at Microsoft and others impacted by the disaster.

      1 user thanked author for this post.
    • #2689090

      Kudos to Susan! This post is absolutely the best explanation of the several articles I’ve read.

      Her first paragraph explains in simple terms what happened — something that countless other writers couldn’t do. She got past the mumbo jumbo of corporate speak, which other technical and mainstream writers seemed to accept.

      9 users thanked author for this post.
      • #2689221

        Totally agree. It was good to read Susan’s article as it was from somebody I trust to explain what happened.

    • #2689098

      Susan,

      Excellent clear concise cogent I wish i had more adjectives to describe how well the article is written.

      Please give consideration to making this article available to all. I’d love to be able to share it. It would also be a good advertisement for plus membership IMHO.

      Thanks again!

      May the Forces of good computing be with you!

      RG

      PowerShell & VBA Rule!
      Computer Specs

      6 users thanked author for this post.
    • #2689060

      My company’s call center took eight times the usual amount of calls for a Friday (nearly 800 calls).

      Some devices responded after multiple restarts if the device was connected to the Internet and a fix automatically pushed to the device. Selecting the last known good system restore points also worked for some customers.

      The difficult part was walking users through getting a device launched into safe mode with networking. Admin credentials usually were needed to delete the offending file if the device was managed.

      I think it may take a week or more for tech departments to restore all the affected machines.

      2 users thanked author for this post.
    • #2689075

      Thank you, Susan. This is extremely helpful information for friends of mine who are suffering the BSOD but don’t have corporate IT departments to help them out of it.

    • #2689109

      I was about to post a reminder of the McAfee-related Windows meltdown from several years ago. A Google search turned up this tidbit from yesterday:

      McAfee-caused PC meltdown and Microsoft-CrowdStrike outage have a common connection

      CrowdStrike boss previously worked for McAfee

      The Microsoft-CrowdStrike outage impacted major banks, media outlets, and airlines, per reports. Likewise, hundreds to thousands of PCs within single companies were affected during the McAfee issue. Adding to the scrutiny, George Kurtz, co-founder and CEO of CrowdStrike, who is at the center of this current issue, was also the CTO of McAfee during its notorious 2010 glitch with Windows XP. This connection further intensifies concerns over the recurring nature of such critical technical failures under his leadership.

      Moderator Note: Post caught in spam bucket. Also, the fact regarding George Kurtz has already been pointed out in the other thread about the Crowdstrike outage here.

    • #2689114

      From the security people I have spoken with, the problem was not in the initial testing. The original code had information left in for testing that would have compromised their whole product (think security key, password, or the like).

      When that data was discovered, they very quickly removed it and then pushed it. It was supposedly at this time it was not checked. This emergency cover-my-a$$ move by the team could be what caused this mess.

      4 users thanked author for this post.
    • #2689121

      Susan Bradley Patch Lady/Prudent patcher

      2 users thanked author for this post.
      • #2689124

        Looks fishy:

        6. Once in Safe Mode, right-click Start, click Run, type cod

        2 users thanked author for this post.
        • #2689137

          Better still :

          If the screen asks for a BitLocker recovery key, use your phone and log on to https://aka.ms/aadrecoverykey. Log on with your Email ID and domain account password to find the BitLocker recovery key associated with your device.

          Select the name of the device where you see the BitLocker prompt. In the expanded window, select View BitLocker Keys. Go back to your device and input the BitLocker key that you see on your phone or secondary device.

        • #2689189

          Looks fishy: 6. Once in Safe Mode, right-click Start, click Run, type cod

          Well, it’s no longer fishy, they’ve fixed their tyop which might very well have been the product of autocorrect thinking it knew better than the article’s author!  😉

          3 users thanked author for this post.
    • #2689165

      This has troubling implications for how their antivirus product is implemented, even leaving aside how they released a botched definition update.

      Specifically, for that to cause a BSOD:

      1. The code in their product that parses a definition file to create an in-memory data structure representing a set of virus signatures was not sufficiently validating an outside input, to wit, said definition file. So it didn’t merely go “this entry is syntactically invalid” and move on to the next, maybe leaving a single new virus not detectable but everything else still working.

      2. Worse, that code didn’t even have guardrails against bad memory accesses, going past the ends of arrays, or similarly, either provided by the development environment (as in languages like Java, C#, Lisp, Smalltalk) or provided manually (for (int i = 0; (t = *(p + i)) != END_TOKEN && i < BUF_MAX; i++)… rather than just (t = *(p + i)) != END_TOKEN). And a munged definition could result in that code triggering a seg fault or similar exception, and thus an app crash.

      3. And, worse still, the parser ran in freaking kernel mode! Rather than running as normal user-mode code and the kernel mode components only ever seeing well-formed post-parsing data structures. This isn’t just turning an app crash into an OS crash, it’s a potential security hole (maybe a malicious definition update could cause arbitrary code execution … in kernel mode!) and it’s just plain *bad* (you’re doing *blocking IO* in *kernel mode*?! Are you freaking nuts?! What happens if it’s reading definition updates from a flash drive and the user ejects it? What if there’s a bad block where the file got written? *tears hair out*)

      TL;DR: This *premium product* is *badly designed*. It’s got mistakes in it that one learns to avoid in Programming 101 courses. Mistakes that, in this instance, will cost billions of dollars to clean up after.

      I sincerely hope the impacted businesses rethink their choice of antivirus vendor … and maybe OS vendor as well. I will note that my home PC with W10 and plain ol’ Windows Defender for its antivirus hummed away merrily through all of this with nary a hiccup. Nor a virus.

      6 users thanked author for this post.
      • #2689355

        There a HUGE difference between consumer level security software and enterprise level security software that you’re missing.

        Consumer level security software normally scans for and detects/deflects/alerts for intrusions/infections after-the-fact because it’s not directly monitoring the running processes and I/O in real-time. It can’t because it’s not running in kernel mode with kernel level access.

        And that’s just fine for most “individual” users.

        Enterprise level security software has to defend a “large number” of different PC’s/Servers and detect/deflect/alert for any intrusions/infections in real-time as they happen.

        Waiting for your security software to alert you about something after-the-fact at a business level is way too late, it’s already spread to other PC’s/Servers, may have caused serious problems, and will be a major PITA to cleanup.

        In order to do real-time detection/deflection/alerting, the software must be able to monitor all running processes and I/O interactions while they’re happening and that means kernel mode with full kernel level access.

        Yes, there’s a HUGE security risk in allowing that, but it’s necessary if you want effective real-time protection!

        It also means you need to have absolute trust in the vendor who’s providing the software and, in this case, Crowdstrike just shot themselves in the foot… with a 12 gauge shotgun!!

        2 users thanked author for this post.
        • #2689461

          Consumer level security software normally scans for and detects/deflects/alerts for intrusions/infections after-the-fact because it’s not directly monitoring the running processes and I/O in real-time.

          We run a consumer level security software suite with a large installed base that runs in real-time. The suite includes 12 device drivers running in kernel mode. The file system, network stack, etc, is hooked. Every packet, file write, etc, is checked in real time without any noticeable performance impact. On my own system, I have see malware caught at the first attempt of infection in real time. Our systems have never had a successful infection.

          Virus signatures etc are updated daily. Unlike some other vendors, we have trusted this consumer vendor to auto update whenever it wants, for the past 20 years.

          Moderator’s Note: Retrieved from spam filter.

          Windows 10 22H2 desktops & laptops on Dell, HP, ASUS; No servers, no domain.

          2 users thanked author for this post.
    • #2689176

      We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices.

      Helping our customers through the CrowdStrike outage — Microsoft Blog

      The number given by Microsoft means it is probably the largest ever cyber-event, eclipsing all previous hacks and outages.

      The closest to this is the WannaCry cyber-attack in 2017 that is estimated to have impacted around 300,000 computers in 150 countries.

      CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

      3 users thanked author for this post.
    • #2689175

      Thank you Susan for the excellent run down on what happened and what is happening to get things back in order.

      Much appreciated!

    • #2689198

      Only slightly off-topic:

      As interconnected world-wide systems become ever more complex and those systems rely on ever more centralized, critical, out-sourced sub-systems (for example CrowdStrike), the whole system becomes increasingly prone to widespread collapse. Sort of a ‘house of cards’. Like the recent Linux kernel scare.

      Bad actors are already increasingly focusing on Supply Chain Attacks – as even CrowdStrike themselves document: https://www.crowdstrike.com/cybersecurity-101/cyberattacks/supply-chain-attacks/  (irony noted!)

      How many other ‘weak spots’ are lurking in critical IT infrastructure that could take down such a large part of the modern world in one fell swoop as this single virus definition file did? I hate to consider what an adversarial nation, terrorist organization or anti-technologist subversive could do if something like this was available as their opening salvo.

      I understand that the cost of resilience, redundancy, security and testing all adversely impact the bottom line yet I’d say more consideration and investment in these areas are needed. Besides, wasn’t the internet originally designed with resilience and redundancy as founding goals?

      Win10 Pro x64 22H2, Win10 Home 22H2, Linux Mint + a cat with 'tortitude'.

      1 user thanked author for this post.
      • #2689518

        Like the recent Linux kernel scare.

        What Linux kernel scare? I am a Linux user and have not been informed of anything out of the ordinary in this regard.

        -- rc primak

    • #2689291

      What? All the social media sites are wrong? They are all posting it was because someone at microsoft pulled a USB drive without shutting it down properly.

      4 users thanked author for this post.
    • #2689304

      Not that I’m suggesting using Windows 3.1/Windows 95…but…

      (from Yahoo!Tech, July 19 2024)

      A Windows version from 1992 is saving Southwest’s butt right now

      Nearly every flight in the U.S. is grounded right now following a CrowdStrike system update error that’s affecting everything from travel to mobile ordering at Starbucks — but not Southwest Airlines flights. Southwest is still flying high, unaffected by the outage that’s plaguing the world today, and that’s apparently because it’s using Windows 3.1.

      Yes, Windows 3.1 — an operating system that is 32 years old. Southwest, along with UPS and FedEx, haven’t had any issues with the CrowdStrike outage. In responses to CNN, Delta, American, Spirit, Frontier, United, and Allegiant all said they were having issues, but Southwest told the outlet that its operations are going off without a hitch.

      3 users thanked author for this post.
      • #2689317

        I’m not convinced that wasn’t a joke.  I also saw a BSOD on a Samsung fridge and that Family digital bulletin board doesn’t run Windows.  Be careful with articles written based on twitter posts.  There was a post going into details as to the bug that was dead wrong.

        Susan Bradley Patch Lady/Prudent patcher

        1 user thanked author for this post.
    • #2689316

      Bitlocker recovery by Delta

      Delta appear to still be the most impacted

      Susan Bradley Patch Lady/Prudent patcher

      1 user thanked author for this post.
    • #2689320

      Microsoft : Helping our customers through the CrowdStrike outage

      ..We’re working around the clock and providing ongoing updates and support. Additionally, CrowdStrike has helped us develop a scalable solution that will help Microsoft’s Azure infrastructure accelerate a fix for CrowdStrike’s faulty update. We have also worked with both AWS and GCP to collaborate on the most effective approaches…

      We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines. While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services. ..

      1 user thanked author for this post.
    • #2689333

      There is something tragicomically ironic about the fact that an A/V company was responsible for this train wreck. The very people who are supposedly protecting you throw a potato into your jet intake.

      It makes me shiver a bit sometimes when I consider what our current level of technological complexity has come to when such a small chunk of data can bring 8 million users to a halt, (See my tagline.)

      I just can’t wait for something like this to happen with a widespread AI integration…predicted so long ago:

      https://www.youtube.com/watch?v=qjGRySVyTDk

      Win7 Pro SP1 64-bit, Dell Latitude E6330 ("The Tank"), Intel CORE i5 "Ivy Bridge", 12GB RAM, Group "0Patch", Multiple Air-Gapped backup drives in different locations. Linux Mint Newbie
      --
      "The more kinks you put in the plumbing, the easier it is to stop up the pipes." -Scotty

      5 users thanked author for this post.
    • #2689334

      There is something tragicomically ironic about the fact that an A/V company was responsible for this train wreck

      https://www.askwoody.com/forums/topic/crowdstrike-agent-got-updated-and-may-be-triggering-bsods/page/3/#post-2689315

      There is something tragicomically ironic that an OS can’t handle a faulty file and crashes.
      That is how it should have been done

      1 user thanked author for this post.
      • #2689519

        There is something tragicomically ironic that an OS can’t handle a faulty file and crashes.

        It depends on what the file is. A kernel-level security driver is pretty critical on some systems. The McAfee fiasco with Windows XP, SP3 many years ago was a result of a single file being quarantined. The file was svchost.exe. Pretty critical, I think. Linux has kernel level files which are just as critical.

        -- rc primak

    • #2689354

      There is something tragicomically ironic about the fact that an A/V company was responsible for this train wreck

      https://www.askwoody.com/forums/topic/crowdstrike-agent-got-updated-and-may-be-triggering-bsods/page/3/#post-2689315

      There is something tragicomically ironic that an OS can’t handle a faulty file and crashes.
      That is how it should have been done

      Microsoft tried to change this in x64 but Symantec and McAfee lost the plot and accused them of trying to lock them in favour of Defender. So Microsoft had to open up APIs to allow them to continue their kernel-patching ways. If that hadn’t been forced on Microsoft, then this may not have happened.

      3 users thanked author for this post.
      • #2689500

        So Microsoft had to open up APIs to allow them to continue their kernel-patching ways. If that hadn’t been forced on Microsoft, then this may not have happened.

        The ability to write device drivers for Windows using kernel mode APIs has been a feature of Windows since the beginning. This allows third-party devices and robust third-party software applications to run on Windows.

        Windows device drivers can be device specific and/or software only. Here is a partial list of third-party vendors from a couple of our systems:

        • Intel (devices and software)
        • NVIDIA (device)
        • Macrium Reflect (backup software)
        • Realtek (device)
        • VeraCrypt (encryption software)
        • Qualcomm (device)

        By eliminating kernel mode access for third-party signed drivers, you make things worse, not better, for hardware and software availability on Windows.

        Windows 10 22H2 desktops & laptops on Dell, HP, ASUS; No servers, no domain.

        3 users thanked author for this post.
    • #2689404

      Are major retailers and grocery stores like Walmart, Target, Home Depot, ShopRite, Safeway, and Giant still experiencing IT problems related to the CrowdStrike outage?

      Are others still experiencing problems.  If so, who?

      I anticipate that in addition to problems at checkout merchants experienced disruptions of the logistics operations – in-store inventory tracking, preparing to restock shelves at the warehouse, as well as transportation issues.

      Are store shelf fully stocked?

      • #2689521

        Are major retailers and grocery stores like Walmart, Target, Home Depot, ShopRite, Safeway, and Giant still experiencing IT problems related to the CrowdStrike outage?

        Some Home Depot stores and some grocery stores did get hit.

        -- rc primak

    • #2689433

      Blue Screens Everywhere Are Latest Tech Woe for Microsoft

      ..CrowdStrike’s bug was so devastating because its security software, called Falcon, runs at the most central level of Windows, the kernel, so when an update to Falcon caused it to crash, it also took out the brains of the operating system. That is when the blue screen of death appeared…

      In 2020, Apple told developers that its MacOS operating system would no longer grant them kernel-level access.

      That change was a pain for Apple’s partners, but it also meant that a blue screen-style problem couldn’t happen on Macs..

      A Microsoft spokesman said it cannot legally wall off its operating system in the same way Apple does because of an understanding it reached with the European Commission following a complaint. In 2009, Microsoft agreed it would give makers of security software the same level of access to Windows that Microsoft gets

      3 users thanked author for this post.
    • #2689442

      “How many other ‘weak spots’ are lurking in critical IT infrastructure that could take down such a large part of the modern world in one fell swoop”

      Cloudflare.

      They’ve huge, they’re ubiquitous, and they’re and mostly invisible except when they screw up. So far that’s just been isolated random error pages or CAPTCHAs from false-positive “bot” detections … so far. If their infrastructure were ever to go TU, so would half the web.

      Windows and Android.

      One single botched OS patch that caused a reboot loop on a significant percentage of machines would wreak havoc on the economy. Especially if it also trashed My Documents or Bitlocker keys. There already have been W10 updates that screwed with My Documents, so it’s not beyond imagining.

      Root certificate authorities and root DNS.

      A bad enough screwup there could basically knock out much of the Internet, maybe even worse than a worst case catastrophe at Cloudflare. Every site built since when Geocities was a happening place could error out, like the last 30 or so years just never happened. (If only … we could use a do-over on climate change. And US politics.)

      Google

      Hypothetical worst case botched Android updates aside, something like 70% of websites have Google Tag Manager, Google Analytics, Google AdWords, or some other Google cruft integrated into them, usually for reasons of greed. A botched update to any of the associated script files *could* just disable some of the ad or tracking functionality. Unless it had a “gain of function” mutation and it crashed people’s browsers, or made them hemorrhage RAM, or scribbled all over the including web page with an errant document.write(), or …

      JQuery

      Almost as ubiquitous as Google, but only sometimes loaded from a centralized site rather than a local copy.

      There are undoubtedly more, most of them widely included scripts like Google Analytics and JQuery.

      As for the need to run some code in kernel mode, a) I am fully aware of that, b) most consumer AV products do that to do real-time monitoring too, and c) it doesn’t explain (let alone excuse) *parsing unvalidated data* in kernel mode rather than doing that part in user mode and passing only validated, well-formed data structures to the kernel mode components.

      1 user thanked author for this post.
    • #2689497

      Win10 Pro x64 22H2, Win10 Home 22H2, Linux Mint + a cat with 'tortitude'.

      2 users thanked author for this post.
    • #2689530

      ” this faulty data file inserts itself into the Windows kernel”

      Is the file itself faulty in some way apart from said insertion?
      Or is is its faultiness in that it inserts?  (i.e. is all insertion always prohibited?)

      1 user thanked author for this post.
      • #2689532

        The file was in the right place but faulty.

        1 user thanked author for this post.
    • #2689630

      You absolve Microsoft from responsibility for OS shutdown because it ‘behaved as required.’
      I disagree. If the OS gets a bad file, why can’t it quarantine the file, go into safe mode, and notify the admin of the problem? The actual response (a BSOD) makes the OS very sensitive (and a preferred target for hackers). Looks like we will see a lot more of these attacks until Microsoft changes its response.

      • #2689637

        That is what it was doing.  That recovery mode is it’s safe mode. Actually bsods are not wanted by attackers.  They want to be silent on a functional system and steal information and credentials.  BSODs are noisy and flag us admins there is a problem.

        The reason vendors are allowed into the kernel in the first place is due to past EU action.

        Susan Bradley Patch Lady/Prudent patcher

        5 users thanked author for this post.
    • #2689645

      The Crowdstrike outage and global software’s single-point failure problem

      https://www.cnbc.com/2024/07/20/crowdstrike-outage-and-global-softwares-single-point-failure-problem.html

      The CrowdStrike fail and next global IT meltdown already in the making

      https://www.cnbc.com/2024/07/20/the-crowdstrike-fail-and-next-global-it-meltdown-already-in-the-making.html

      Windows 10 22H2 desktops & laptops on Dell, HP, ASUS; No servers, no domain.

      1 user thanked author for this post.
    • #2689652

      The reason vendors are allowed into the kernel in the first place is due to past EU action

      EU doesn’t mandate Microsoft behavior in the rest of the world.
      Just like there are N and KN… Windows versions for the EU, Microsoft could have done ‘Apple’.

    • #2689676
      Windows - commercial by definition and now function...
      1 user thanked author for this post.
    • #2689690

      If the OS gets a bad file, why can’t it quarantine the file, go into safe mode, and notify the admin of the problem?

      Michael, let’s try a variation of your fix idea, more generic and applicable to more cases. When you have a blue screen, offer a rollback option.

      The end user will be told to contact IT and ask permission to rollback. The end user will select yes or no and reboot.

      Upon rebooting, the previous restore point will be restored, if the answer was yes. (All kernel mode updates will require the taking of a restore point first.)

      This Crash Rollback option will be useful for certain classes of blue screens [or other failures], to help isolate or temporarily workaround these Windows crashes. (Before first rebooting, the end user will select the number of hours to suspend [kernel mode] auto updates and enter the number recommended. Those products with access to kernel mode will be required to honor the suspension of auto updates.)

      Windows 10 22H2 desktops & laptops on Dell, HP, ASUS; No servers, no domain.

      1 user thanked author for this post.
    • #2689789

      https://doublepulsar.com/what-i-learned-from-the-microsoft-global-it-outage-d6138c06ebdb

      I would also recommend reading that.

      Reading hasn’t convinced me Microsoft not to blame.

      Of course Microsoft can develop 2 versions of API for a/v software, one for EU one for the rest of the world.

      • #2689985

        “Of course Microsoft can develop 2 versions of API for a/v software, one for EU one for the rest of the world.”  What about a computer that is portable and travels between the EU and the rest of the world?  What version of the software would it need to run?  And the Windows Update process would have to determine what version is running to enable the correct version of a patch to be installed.

    • #2690009

      What about a computer that is portable and travels between the EU and the rest of the world?

      That is no problem. Windows version has nothing to do with your changing locations.
      You will get updates according to your country and language in settings.

      Apple has already solved that with its Apple Intelligence, 3rd party apps stores…
      It has a system for EU and for the rest of the world.

      1 user thanked author for this post.
    Viewing 31 reply threads
    Reply To: Bad antivirus definition triggers shutdowns

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: