CrowdStrike IT Outage Explained by a Windows Developer

Published: Jul 20, 2024 Duration: 00:13:40 Category: Science & Technology

Trending searches: microsoft issue
hey I'm Dave welcome to my shop I'm Dave plumber a retired software engineer from Microsoft going back to the MS DOS at Windows 95 days and thanks to my time as a Windows developer today I'm going to explain what the crowd strike issue actually is the key difference in curdle mode and why these machines are blue screening as well as how to fix it if you come across one now I've got a lot of experience working up to blue screens and having them set the tempo of my day but this Friday was a little different however first off I'm retired now so I don't debug a lot of blue screens and second I was traveling in New York City which left me temporarily stranded as the airlines sorted out the digital Carnage but that downtime gave me plenty of time to pull out the old MacBook and figure out what was happening to all the windows machines around the world as far as we know the crowd strike blue screens that we've been seeing around the world for the last several days are the result of a bad update to the crowd strike software but why so today I want to help you understand three key things first why the crowd strike software is on the machines at all and second what happens when a kernel driver like crowd strike fails and finally we'll look at precisely why the crowd strike code fults and brings the machines down and how and why this update caused so much Havoc as systems developers at Microsoft in the 1990s handling crashies like this was part of our normal bread and butter every Dev at Microsoft at least in my area had two machines for example when I started in Windows NT I had a Gateway 486 dx250 as my main Dev machine and then some old 386 box as a debug machine normally you'd run your test or debug bits on the debug machine while connected to it as the debugger from your good machine on nights and weekends however we did something far more interesting we ran a process called anti-stress now anti-stress was a bundle of tests that would automatically download to the test machines and run under the debugger and so every night every test machine along with all the machines in the various labs around campus would run anti stress and put it through the gauntlet the stress tests were normally written by our test Engineers who were software developers specially employed back in those days to find and catch bugs in the system so as an example they might write a test to Simply allocate and use as many GDI brush handles as possible if doing so causes the drawing subsystem to become unstable or causes some other program to crash then it would be caught and stopped in the debugger immediately the following day all of the crashes and assertions will be tabulated and assigned to an individual developer based on the area of code in which the problem occurred as the developer responsible that you would then use something like telnet to connect to the Target machine debug it and sorted out what went wrong all this debugging was done in Assembly Language whether it was Alpha myips power PC or x86 and with minimal symbol table information so it's not like we had Visual Studio connected still it was enough information to sort out most crashes find the code responsible and either fix it or at least enter a bug to track it in our database the hardest issues to sort out were the ones on that took place deep inside the operating system kernel which executes at ring zero on the CPU you see the operating system uses a ring system to bifurcate code into two distinct types kernel mode for the operating system itself and user mode where your applications run kernel mode does tasks such as talking to the hardware and the devices managing memory scheduling threads and all of the really core functionality that the operator system provides application code never runs in kernel mode and kernel code never runs in user mode kernel mode is more privileged meaning it can see the entire system memory map and what's in memory at any physical page in any instance user mode only sees the memory map pages that the colel wants you to see so if you're getting the sense that the kernel is very much in control that's an accurate picture even if your application needs a service provided by the kernel it won't be allowed to just run down inside the kernel and execute it instead your user thread will reach the kernel boundary and then raise an exception and wait a kernel thread on the Kernel side then looks at the specified ARG ments fully validates everything and then runs the required kernel code when it's done the kernel thread Returns the results to the user thread and let it continue on its merry way there is one other substantive difference between kernel mode and user mode when application code crashes the application crashes when kernel mode crashes the system crashes it crashes because it has to imagine a case where you had a really simple bug in the kernel that freed memory twice when the kernel code detects that it's about to free already freed memory it can just detect that this is a critical failure and when it does it bluec screens the system because the Alternatives could be worse consider a scenaria where this double freed code is allowed to continue maybe with an airror message maybe even allowing you to save your work the problem is that things are so corrupted at this point that saving your work could do more damage erasing or corrupting the file Beyond repair worse since it's the kernel system that's experiencing the issue application programs are not protected from one another in the same way the last thing you want is Solitaire during a kernel bug that damages your GI enlistment and that's why when an unexpected condition occurs in the kernel the system is just halted this is not a Windows Thing by any stretch it is true for all modern operating systems like Linux and Mac OS as well in fact the biggest difference is the color of the screen when the system goes down on Windows it's blue but on Linux it's black and on Mac OS it's usually pink but as on all systems a kernel issue is a reboot at a minimum now that we know a bit about kernel mode versus user mode Let's talk about what spefic specifically runs in kernel mode and the answer is very very little the only things that go in the kernel mode are things that have to like the thread schedule and the Heap manager and functionality that must access the hardware such as the device driver that talks to a GPU across the pcie bus and so the totality of what you run in curdle mode really comes down to the operating system itself and device drivers and that's where crowd strike enters a picture with their Falcon sensor Falcon is a security product and while it's not just simply an antivirus it's is not that far off the mark to look at it as though it's really anti- maware for the server but rather than just looking for file definitions it analyzes a wide range of application Behavior so that it can try to proactively detect new attacks before they're categorized and listed in a formal definition and to be able to see that application behavior from a clear vantage point that code needed to be down in the kernel without getting too far into the weeds of what crowd strike Falcon actually does suffice it to say that it has to be in the kernel to do it and so crowd strike wrote a device driver even though there's no Hardware device that it's really talking to but by writing their code as a device driver it lives down with the kernel in ring zero and has complete and unfettered access to the system data structures and the services that they believe it needs to do its job now everybody at Microsoft and probably at crowd strike is aware of the stakes when you run code in kernel mode and that's why Microsoft offers the whql certification which stands for Windows Hardware quality Labs drivers labeled this whql certified have been thoroughly tested by the vendor and then have passed the windows Hardware lab kit testing on various platforms and configurations and are signed digitally by Microsoft as being compatible with the Windows operating system by the time a driver makes it through the whql lab test and certifications you can be reasonably assured that the driver is robust and trustworthy and when it's determined to be so Microsoft issues that digital certificate for that driver as long as the driver itself never changes the certificate remain remains valid but what if you're crowd strike and you're agile ambitious and aggressive and you want to ensure that your customers get the latest protection as soon as new threats emerge every time something new pops up on the radar you could make a new driver and put it through the hardware quality Labs get it certified signed and release the updated driver and for things like video cards that's a fine process I don't actually know what the whql turnaround time is like whether that's measured in days or weeks but it's not instant and so you'd have a Time window where a zero day could propagate and spread simply because of the delay in getting an updated crowd strike driver built and signed what crowd strike often to do instead was to include definition files that are processed by the driver but not actually included with it so when the crowd strike driver wakes up it enumerates a folder on the machine looking for these dynamic definition files and it does whatever it is that it needs to do with them but you can already perhaps see the problem let's speculate for a moment that the crowd strike dynamic definition files are not mer malware definitions but complete programs in their own right written in a PE code that the driver can then execute in a very real sense then the driver could take the update and actually execute the PE code within it in curdle mode even though that update itself has never been signed the driver becomes the engine that runs the code and since the driver hasn't changed the sech is still valid for the driver but the update changes the way the driver operates by virtue of the P code that's contained in the definitions and what you've got then is unsigned code of unknown provenance running in full kernel mode all it would take is a single little bug like a null point of reference and the entire Temple would be torn down around us put more simply while we don't yet know the precise cause of the bug executing untrusted PE code in the kernel is Risky Business at best and could be asking for trouble we can get a better sense of what went wrong by doing a little postmortem debugging of our own first we need to access a crash dump report the kind you used to get in the good old an days but are now hidden behind the happy face blue screen depending on how your system is configured though you can still get the crash dump info and so there was no real shortage of dumps around to look at here's an example from Twitter so let's take a look about a third of the way down you can see the offending instruction that caused the crash it's an attempt to move data to register nine by loading it from a memory pointer in register 8 couldn't be simpler the only problem is that the pointer in register 8 is garbage it's not a memory addressed at all but a small integer of 9 C hex which is likely the offset of the field they're actually interested in with in the data structure but they almost certainly started with a null pointer then added 9C to it and then just dereferenced it now debugging something like this is often an incremental process where you wind up establishing okay so this bad thing happened but what happened Upstream beforehand to cause the bad thing and in this case it appears that the cause is the dynamic data file downloaded as a Cy file instead of containing pcode or a malware definition or whatever was supposed to be in the file it was all just zeros we don't know yet how or why this happened as crowd strike hasn't publicly released that information yet what we do know to an almost certainty at this point however is that the crowd strike driver that processes and handles these updates is not very resilient and appears to have inadequate air checking and parameter validation parameter validation means checking to ensure that the data and arguments being passed to a function and in particular to a kernel function are valid and good if they're not it should fail the function call not cause the entire system to crash but in the crowdstrike case they've got a bu they don't protect against and because their code lives in ring zero with the kernel a bug and crowd strike will necessarily bug check the entire machine and deposit you into the very dreaded recovery blue screen now even though this isn't a Windows issue or a fault with Windows itself many people have asked me why Windows itself isn't just more resilient to this type of issue for example if a driver fails during boot why not try to boot next time without it and see if that helps and windows in fact does offer a number of facilities like that going back as far as booting n with last KN and good registry Hive but there's a catch and that catch is that crowd strike marked their driver as what's known as a boot driver a boot driver is a device driver that must be installed to start the Windows operating system most boot drivers are included in driver packages that are in the box with Windows and windows automatically installs these boot start drivers during their first boot of the system my guess is that crowd strike decided they didn't want you booting at all without their protection provided by their system but when it crashes as it does now your system is completely borked fixing a machine with this issue is fortunately not a great deal of work but it does require physical access to the machine to fix a machine that's crashed due to this issue you need to boot it into safe mode because safe mode only loads a limited set of drivers that mercifully can still contend without this boot driver you'll still be able to get into at least a limited system then to fix the machine use the console or the file manager and go to the path window like Windows and then system through 32 drivers crowd strike in that folder find the file matching the pattern C and then a bunch of zeros 2 91. cist and delete that file or anything that's got the 291 in it with a bunch of zeros when you reboot your system should come up completely normal and operational the absence of the update file fixes the issue and does not cause any additional ones it's a fair bet that the update 291 won't ever be needed or used again so you're fine to Nuke it if you found today's episode to be any combination of informative or entertaining remember I'm mostly in this for the subs and likes so I'd be honored if you consider subscribing to my channel and leaving a like on this video and if you're already subscribed thank you please consider sending this video to a friend if you think it covered the subject well and please do check out the free sample of my new book on Amazon the non-visible part of the autism spectrum it's intended for folks that don't have ASD but who suspect they might have a few characteristics that put them somewhere on the autism spectrum it's everything I know now about living a successful life on the spectrum that I wish I'd known long ago check it out at the link in the video description in the meantime and in between time hope to see you next time right here in Dave's Garage

Share your thoughts

Related Transcripts

CrowdStrike Update: Latest News, Lessons Learned from a Retired Microsoft Engineer thumbnail
CrowdStrike Update: Latest News, Lessons Learned from a Retired Microsoft Engineer

Category: Science & Technology

Hey i'm dave welcome to my shop i'm dave plumber a retired microsoft software engineer starting our windows back in the early 1990s and today i'm going to update you on all the latest fulcon news as well as some want and speculation and even conspiracy theories on the crowd strike falcon it oage if... Read more

Microsoft-CrowdStrilke outage|#microsoftoutage |#cybexsword| Crowd Strike| BSOD thumbnail
Microsoft-CrowdStrilke outage|#microsoftoutage |#cybexsword| Crowd Strike| BSOD

Category: Education

[music] a massive it outage is currently affecting computer systems worldwide in australia and aoa new zealand reports indicate computers at banks media organizations hospitals transport services shop checkouts airports and more have all been impacted millions of windows users across the globe are experiencing... Read more

#Microsoft में दिक्कत का मिल गया सॉल्यूशन#AirlinesServer #MicrosoftServer #news #shorts #indianarmy thumbnail
#Microsoft में दिक्कत का मिल गया सॉल्यूशन#AirlinesServer #MicrosoftServer #news #shorts #indianarmy

Category: News & Politics

माइक्रोसॉफ्ट की सर्विसेस आउटेज की वजह से यूजर्स कई शिकायतें कर रहे हैं कुछ लोगों के सिस्टम खुद से बंद हो जा रहे हैं तो वहीं कई यूजर्स को ब्लू स्क्रीन नजर आ रही है भारत अमेरिका समेत कई देशों में विमानों की उड़ान पर इस आउटेज का असर पड़ा है लेकिन इसे कैसे ठीक कर सकते हैं आप अगर आप भी इस दिक्कत से प्रभावित हैं तो कंपनी ने इसके रिकवर करने के स्टेप्स को पोस्ट किया है हालांकि इससे आपको सभी सर्विसेस का एक्सेस तो नहीं मिलेगा लेकिन ठीक हो चुकी सर्विसेस को इस्तेमाल... Read more

Windows Down ? - Why Microsoft Crashed Worldwide thumbnail
Windows Down ? - Why Microsoft Crashed Worldwide

Category: Science & Technology

यार आपको पता चला माइक्रोसॉफ्ट डाउन हो गया आपको पता चला कल youtube1 प डाउन हो गया था ये सब डाउन क्यों हो रहे हैं सागर भाई को छोड़ के सब डाउन हो रहे हैं सागर भाई अप एंड एक्टिव है दोस्तों आज मैं बात करने वाला हूं कि यह जो माइक्रोसॉफ्ट है ये क्यों डाउन हुआ था और अभी कल लोग य रिपोर्ट कर रहे थे कि youtube1 व्यू है उस वीडियो पे यहां पर मैंने एक छोटे से नोटिस पे एक वीडियो बना दी थी ऐसे मैं सो रहा था मैंने उठ के बता दिया कि भाई माइ डाउन... Read more

Microsoft Outage ਦਾ ਅਸਰ ਕਾਇਮ, Airport ਤੋਂ ਲੈ ਕੇ ਬੈਂਕਿੰਗ ਸਿਸਟਮ ਪ੍ਰਭਾਵਿਤ thumbnail
Microsoft Outage ਦਾ ਅਸਰ ਕਾਇਮ, Airport ਤੋਂ ਲੈ ਕੇ ਬੈਂਕਿੰਗ ਸਿਸਟਮ ਪ੍ਰਭਾਵਿਤ

Category: News & Politics

सत श्री अकाल टीवी पंजाब देख रहे दर्शका का स्वागत मैं हा दीपिका खोसला इस वेले अहम खबर तो नाल सांझी कर रहे माइक्रोसॉफ्ट दे क्राउड स्ट्राइक अपडेट दे कारण पूरी दुनिया वि जो हड़कंप मच उसन लेके ताजा अपडेट सामने आ रही है तो दस कि एक तकनीकी गड़बड़ी दे चलते हजारा फ्लाइट रद्द करया पै गईया सी ते कई बैंका दिया सेवा तक ठप हो गईया सी थे ही जेकर ताजा अपडेट द गल करिए तो कई हवाई अड उते समस्या जो है अजे भी बरकरार है क्योंकि कई बैकलॉग अजे भी हन... Read more

Microsoft's Outage CHAOS: What Happened? Whats CrowdStrike? thumbnail
Microsoft's Outage CHAOS: What Happened? Whats CrowdStrike?

Category: Science & Technology

[music] in today's video we are diving deep into how microsoft's crowd strike update left the airline industry in cowos and what it means for you stay tuned because by the end of this video you will know all the crucial details and how it affects your travel plans before we get started make sure to... Read more

Real men test in production… The truth about the CrowdStrike disaster thumbnail
Real men test in production… The truth about the CrowdStrike disaster

Category: Science & Technology

Last friday the world finally got the y2k experience it deserved when millions of windows machines went down thanks to a bad update from cyber security firm crowd strike 8.5 million to be exact but now the plot is thickened and multiple theories for why this actually happened have emerged a was it just... Read more

Russia-Ukraine war: Russia claims control of 6 Ukrainian villages | WION Pulse thumbnail
Russia-Ukraine war: Russia claims control of 6 Ukrainian villages | WION Pulse

Category: News & Politics

The russia ukraine war is entering a pretty critical phase as moscow escalates the intensity of its attacks and now the united states and nato allies are mulling on the supply of long range cruise missiles a move that could dramatically shift the wars trajectory eastern ukraine is in fact bearing the... Read more

Microsoft Global Outage Breakdown Shutdown #viral #news #ytviral #ytvideo thumbnail
Microsoft Global Outage Breakdown Shutdown #viral #news #ytviral #ytvideo

Category: News & Politics

जय श्री राम आप लोगों का एक बार फिर से स्वागत है हमारे यू चैनल खुल के कहू में और आज हम बात करने वाले हैं माइक्रोसॉफ्ट ग्लोबल आउटेज के बारे में जैसा कि आप लोगों को पता होगा न्यूज तो आपने सुनी होगी कि 19 जुलाई 2024 को अचानक से माइक्रोसॉफ्ट की सेवाएं ठप हो गई थी और यूटीसी टाइम जोन के हिसाब से देखे तो अगर अमेरिकन टाइम जोन के हिसाब से देखे तो लगभग सुबह के 8 बजे के आसपास लोगों ने इसके बारे में रिपोर्ट करना शुरू कर दिया था और य जो माइक्रोसॉफ्ट... Read more

Microsoft outage cause Explained | Why it happened, What is the reason, BSOD | What is CrowdStrike thumbnail
Microsoft outage cause Explained | Why it happened, What is the reason, BSOD | What is CrowdStrike

Category: Education

The microsoft outage on friday 19th divided the world into two parts one that love the fact that it's going to be an easy long weekend and the other well let's just say it was a nightmare of an event for them and in many ways it definitely looked like a trailer to such an event so what exactly happened... Read more

CrowdStrike Created a Major Outage, AT&T & Hackers | cybernews.com thumbnail
CrowdStrike Created a Major Outage, AT&T & Hackers | cybernews.com

Category: Science & Technology

Intro hi i am joe, that news ai you  recommended to your friend to follow.  because you did it. right? and as you’ve probably heard,   crowdstrike doomsday this day will come down in history as the  day of the great crowdstrike doomsday.  because on early friday morning or late thursday  evening depending... Read more

The Microsoft Outage Explained: How It Happened thumbnail
The Microsoft Outage Explained: How It Happened

Category: News & Politics

One small computer update gone wrong and the  world comes to a crashing halt so as we navigate   our increasingly digital world is the largest  it outage in history a wakeup call on just how   vulnerable we've become flight grounded it stinks  it's scary banks medor outlets government agencies   hospital... Read more