The Day the Computers Turned Blue. (This Issue Is Not the Result of or Related to a Cyberattack. We Did It to Ourselves.)
As the only thing in the universe that writes software, we are the best at it. But is what we write any good? Do we make poor software? And is it just because it's cheaper?
I don’t think it’s too early to call it: this will be the largest IT outage in history - Troy Hunt, https://x.com/troyhunt/status/1814198760387596407
Last week, millions of computers around the world crashed with the "Blue Screen of Death". I would imagine that anyone who has used a Windows computer has experienced the BSOD–the Blue Screen of Death. It's really quite a horrible looking thing. Just a big old blue expanse of a screen–like one is floating in the ocean after a massive storm, holding onto the last piece of our boat, sharks circling.
What happened?
A cybersecurity company called Crowdstrike released a buggy update that caused computers to crash. What was interesting was the breadth of the bug, as Crowdstrike is embedded in millions of computers, and this caused many of those millions of computers to stop working.
Several airlines and other large organizations–and of course many smaller ones–rely on the Windows operating system to run the software that runs their businesses, and so they had a difficult time...well, running their businesses.
It's hard to say exactly what the technical reason for the crash is. Ultimately, we'll have to wait for a hopefully robust and detailed report from Crowdstrike on the faulty code and deployment process. What's already clear is that the updated piece of software was not what Crowdstrike considers code, but rather what they consider "configuration," which is apparently less rigorously tested. Although configuration changes are just as likely to cause downtime as anything else.
Software is Hard
Wake up. If a single bug can take down airlines, banks, retailers, media outlets, and more, what on earth makes you think we are ready for AGI? The world needs to up its software game massively. We need to invest in improving software reliability and methodology... - https://garymarcus.substack.com/p/dont-look-up-the-massive-microsoftcrowdstrike
I like reading Gary Marcus' posts because he's hell-bent on making sure we don't mess up our use of AI. I'm not on the same spot on the "AI is dangerous spectrum or matrix" as he is, but I like his posts because he takes a pretty antagonistic position, which at the very least is fun. So keep in mind that the quote attributed is from that perspective. But he also suggests that we're just not that good at building software. That...that I agree with. This is a pretty common refrain, the idea that we build "poor software." Of course, it's a subjective opinion because as far as we know, human beings are the only things in the universe that create software, so we are the best, the only, so we can't be bad at it. But still.
- Software is written by human beings
And we make mistakes.
- Software is very complex
There is too much complexity for us to deal with easily. Even "two pizza teams" are probably running out of space from a complexity standpoint.
- Software always has bugs
On the other hand, while we are doing our best to keep our software running safely over time, what we create almost always has bugs. Even software that has been around for a long time and has been tested a lot still has bugs. And most bugs can turn into security problems. So we also need to update software because we need to fix the known bugs.
- Software must change over time
Software never stops changing. When it stops changing, it usually stops working. Software that doesn't run isn't very useful. But these changes are a double-edged sword, because while they keep the software working, they can also introduce new bugs. So while we are constantly changing our software, sometimes to keep it working, sometimes to add new features, and sometimes to fix security problems...we are always writing more code, not less.
- We re-introduce bugs we already fixed
But we can also re-introduce bugs that we fixed before! This are quite common: what we call regressions. An example is the latest SSH security issue.
In our security analysis, we identified that this vulnerability is a regression of the previously patched vulnerability CVE-2006-5051, which was reported in 2006. A regression in this context means that a flaw, once fixed, has reappeared in a subsequent software release, typically due to changes or updates that inadvertently reintroduce the issue. This incident highlights the crucial role of thorough regression testing to prevent the reintroduction of known vulnerabilities into the environment. This regression was introduced in October 2020 (OpenSSH 8.5p1). - https://blog.qualys.com/vulnerabilities-threat-research/2024/07/01/regresshion-remote-unauthenticated-code-execution-vulnerability-in-openssh-server
- Configuring software is hard
To run software systems we have to configure them. This is not easy. Making changes to those configurations is even harder.
- Even distributing software is hard
We know that software has bugs. So we have to test it. Then we have to distribute it, i.e. "roll it out" in a safe way. This is also difficult, especially on a large scale. And we have to write software to do it...and guess what that software has? Bugs.
- For the foreseeable future we will keep writing vast amounts of (poor?) software
This isn't going to end any time soon. GenAI will speed it up too. Even I'm writing software and I'm bad it! Go humanity!
Software Monoculture
As I mentioned earlier, millions of Windows computers were affected by this event. Linux and Mac computers were not (spared this time).
Is what happened with Crowdstrike an example of a software monoculture?
“We can no longer tolerate solutions or architectures that risk crumbling from a single point of failure.” That speaker was CrowdStrike’s vice president and counsel for privacy and cyber policy Drew Bagley, who gave a talk sponsored by the Austin, Tex., company at a Washington Post “Securing Cyberspace” event June 6. - https://newrepublic.com/article/184053/weve-seen-crowdstrike-windows-outage-beforeand-will
Note that the above person quoted is CrowdStrike's VP Drew Bagley.
The New Republic article goes on to say:
CrowdStrike does not have close to a Microsoft-esque lock on the market—it holds only 18.5% of the endpoint-security market in the second quarter of 2023, per data from the market-research firm Canalys. But that still represents a nontrivial chunk of the IT market. Experts are already calling Friday’s incident “the largest IT outage in history.”
Monoculture is a business decision, and often a good one, as we tend to achieve significant cost savings through things like "interoperability, standardization, and scale" (see New Republic article). But it is a trade-off.
Perhaps more interesting is the fact that a small number of companies have collectively accumulated a massive attack surface. If a malicious actor were to gain access to even a few of these companies, they would have tremendously valuable tactical position.
According to a recent report by supply-chain security firm SecurityScorecard, scans of internet-accessible devices show 90% of the global external attack surface is concentrated in products and services from just 150 firms. Just 15 companies accounted for a full 62%. - https://www.itbrew.com/stories/2024/05/23/just-150-companies-have-90-of-global-attack-surface-report-finds
While this Crowdstrike issue is probably not an example of a software monoculture, it is certainly a reminder of how much of the world runs on plain old Windows servers and clients, how much of the EDR market Crowdstrike owns, and how we continue to consolidate power (and thus a massive attack surface) into a smaller and smaller cadre of companies.
There Is No (Current) Economic Advantage to High Quality Software
If the Consortium for Information and Quality Software is correctly estimating in its report, The Cost of Poor Software Quality in the US: A 2020 Report, that the total cost of poor software quality in the US was $2.08...trillion - https://www.kroll.com/en/insights/publications/cyber/economics-secure-software-development
We can always write more secure software. It's a choice. But it's ultimately a choice based on economics. It costs more to write secure software, and the more "secure" we try to make it, the more expensive it gets. Also, security is all about tradeoffs, and those tradeoffs often come in the form of making the software harder to use, or making it take longer to get into the hands of demanding customers, which can be a death knell for software companies that need to ship software quickly.
So it usually (always?) makes financial sense not to write secure software and not to build secure distribution mechanisms–never mind that these things are hard to do and require a certain amount of specialized experience that companies are often unwilling to invest in, mostly because no one is demanding it as part of the product.
Overall, we are more comfortable with the occasional massive problem like this, and the fact that almost everything is hackable, than we are with the costs associated with better software and systems.
Crowdstrike made certain economic decisions around writing and distributing software, and currently their stock is down, about 30%, at the time of writing.
December 13 [2020] SolarWinds begins notifying customers, including a post on its Twitter account, “SolarWinds asks all customers to upgrade immediately to Orion Platform version 2020.2.1 HF 1 to address a security vulnerability.” - https://www.csoonline.com/article/570537/the-solarwinds-hack-timeline-who-knew-what-and-when.html
Solar Winds also experienced a large cybersecurity event that didn't hurt their stock too much at the time, in fact it hit an all-time high after the event, but overall the stock has been much lower and trading sideways in the years since then.
This "Solar Winds" event was actually much smaller than what just happened with Crowdstrike. Overall, long-term, Solar Winds may have been affected by considerable reputational damage. It's difficult to say exactly how they ended up where they are now, but one could imagine that they might be in a better place had the event not occurred.
Equifax also had a massive breach in 2017, but their stock has done quite well since then. It seems there is no reputational damage here...
This Issue Is Not the Result of or Related to a Cyberattack.