What Exactly Happened with the Worldwide IT Outage – and What Can Aviation Do?
On Friday 19th July, 8.5 million Windows machines were brought down leading to the infamous ‘Blue Screen of Death’. This plunged the aviation industry momentarily into chaos, with thousands of flights cancelled or delayed spread unevenly across the network on what should have been one of the busiest days of the summer for Europe’s airports. As the smoke clears in the aftermath, some of the many questions being asked are why, how, and what can we do to stop this from happening again. Based on what we know so far, this article intends to answer these questions.
In short, a defective software update from cybersecurity company CrowdStrike affected users of Microsoft (Windows) services. CrowdStrike have now issued a Preliminary Post Incident Review, giving a brief technical breakdown and an analysis of the recommended fix and provided update. This makes it possible to backtrack and paint a picture of where the software failed.
Firstly, it is essential to lay the groundwork and understand why cybersecurity software is even capable of causing such catastrophic outages in the first place. This comes down to the way computers manage their resources, such as memory, CPU, and devices such as keyboards/screens. At the heart of computer operating systems is a piece of software called the kernel, which manages these resources, allowing programs to run without interfering with each other. The kernel acts as “the boss”, having privileged access to system resources. For instance, when multiple applications want to use the keyboard, or save something to memory, the kernel oversees allowing access. Applications such as browsers and word processors do not have the same levels of privilege as the kernel – they must ask the kernel for permission to do things. For most software, this works well, and presents plenty of benefits to the user. Firstly, if the application becomes buggy, or maliciously infected, it is possible for the kernel to block – or even terminate – the application: a built-in layer of security by architectural design. Secondly, it means that even if one application crashes, the rest of the system is usually able to continue operating. If your word processor crashes, your computer rarely crashes altogether – as the kernel remains operative and separate from such software.
Security software adds a layer of complication. Cyber threats are not only growing in number, but becoming increasingly complex. New methods of attack are being invented constantly, and for critical infrastructure such as airports, cybersecurity software that is able to proactively monitor such threats is essential. Such software requires access to the kernel itself. This allows security software to have comprehensive access to the entire system, in order to monitor threats more effectively. CrowdStrike’s Falcon platform operates in this manner. Allowing third-party software, like CrowdStrike, access to the kernel is essential for their software to work properly – but it comes with high risks. Kernels can crash too, just like applications: except when the kernel crashes, so does the entire system. This is a safety feature. The kernel is responsible for such important actions within the system that modern computers will shut down when the kernel is no longer able to operate. Doing so reduces the chance of catastrophic data loss or the machine becoming permanently inoperable. Users typically recognise this shutdown as the ‘Blue Screen of Death’ on Windows computers.
What exactly went wrong in this case? While the full Root Cause Analysis has yet to be released by CrowdStrike, their preliminary investigation can be summarised as follows. On Friday, July 19, 2024 at 06:09 CEST, CrowdStrike released a small update relating to a recently discovered method of cyberattack. CrowdStrike’s software often updates, as it continuously monitors for novel attack techniques and once discovered, updates its own system to recognise them in future. One file within this update, Channel File 291, had a mistake: it referred to a memory location that doesn’t exist – much like a boarding pass stating gate 48, when the airport in question only has 40 gates. This is known as an out-of-bounds memory read and triggers an exception (an error). The file operates within the kernel, and the kernel was unable to gracefully handle (fix) the issue: thus, the kernel shut down as a safety measure – bringing the entire system down with it.
This meant that any system online during a period of about 90 minutes downloaded that update, with the corrupt file, and experienced an outage. What made the CrowdStrike outage particularly long-lasting is that it could not be fixed remotely. CrowdStrike issued a remote fix that prevented any new computers from being taken out of service, but in order to fix machines already down, in-person manual assistance was required. Affected computer systems could be rebooted in safe mode (another safety procedure of modern computers, in which a computer boots with only the most essential files running). The corrupted file had to be manually located and deleted, by hand, on each one of the approximately 8.5 million devices that had downloaded it.
As with any large-scale catastrophe, the question of who is to blame quickly arises. CrowdStrike have recognised that the faulty file came from them and have pledged to improve testing procedures, as remarkably, the faulty file in question passed all their current testing procedures. They have also pledged to allow customers greater control over when updates are implemented in their system, as a matter of internal policy. However, many are looking beyond the specific fault and are questioning how such a fault could occur in the first place: questioning why Microsoft even allows third-party access to the kernel. Such access, as mentioned above, is inevitably risky business, where having the low-level, core privileges needed to catch the most complex security threats does bring the risk of pulling an entire system down if that kernel is affected.
In a statement to the Wall Street Journal, Microsoft blamed the European Commission for this attack, stating the groundwork was laid through an agreement made between them in 2009 regarding access to the kernel. Prior to this, Microsoft did not allow third party software to have access to the kernel – this was kept exclusively for Microsoft’s own products. This offers increased safety, as no third-party developers could introduce faulty code – or code at all – into the kernel as a matter of policy. The European Commission argued that this was a monopolistic practice, and after years of negotiation, Microsoft agreed to allow kernel access to third-party developers. Apple did not make this agreement and does not allow such kernel access, makes this specific failure impossible for their software (explaining why Macs were not affected by this outage).
Regardless, lessons have already been learned regarding the outage. It is likely that all cybersecurity firms will invest in more rigorous testing of updates and develop further the frameworks they use to test new code before it ever reaches customers. However, from a customer – particularly an airport – perspective, actions can already be taken. Internal policy can act as a robust, additional layer of security against such outages. The requirement for technically trained engineers to be on site was essential to fixing the CrowdStrike outage. As more of our IT infrastructure becomes remote, so too does technical assistance: in this case, engineers physically able to reach the site were essential. The use of redundant systems, particularly for regular backups of all critical systems and data, can contribute to a quicker restoration of operations when effectively implemented. An organisation-wide policy of not permitting automatic updates of software components may also become more commonplace.
As aviation systems become increasingly digital and interconnected, the sector must find a fair balance between resilience and adequate protection from cyber threats and software vulnerabilities. The CrowdStrike incident demonstrates the potential for even minor technical issues to escalate into major operational disruptions, affecting thousands of flights and causing widespread chaos. It serves as both a warning and an opportunity for the aviation industry to reassess and fortify its digital defences, ensuring safer and more reliable operations across the entire chain.
This article was reviewed by the ACI EUROPE Cybersecurity Committee.
Amy Leete is ACI EUROPE’s Communications Manager. She is also a part-time student of the University of York’s MSc in Computer Science & Artificial Intelligence programme, with a particular interest in algorithms, data structures, and the potential of quantum computing in traditional algorithmic encryption methods.