CrowdStrike outage lessons learned: Questions to ask vendors
In light of the recent CrowdStrike outage, security teams should ask their vendors 10 key questions to ensure they're prepared should a similar event occur.
The July 2024 CrowdStrike Channel File 291 incident was a significant event for many security practitioners. While the number of affected devices was relatively small -- estimated to be about 8.5 million devices, less than 1% of all hosts running Windows -- the impact was significant. The outage led to air traffic delays and hospitals going offline, as well as issues in banking, retail, entertainment and beyond.
Anytime there is an event of this magnitude, it behooves security practitioners -- including CISOs like myself -- to evaluate how we do what we do and how we can better protect our organizations, considering lessons learned from the event. The bigger and more impactful the incident, the more questions we should ask as a result -- and the more important the lessons.
What can we learn from the Channel File 291 incident to minimize the likelihood of future outages? Let's examine questions to ask vendors -- and, in some cases, ourselves -- to prevent similar incidents going forward.
Question 1. When can a vendor initiate changes?
Any organization that offers a service via the internet understands the value of placing change control protections around the production environment. Most of us have complex release and delivery processes in place to ensure engineers can't just make production changes willy-nilly and must instead go through some level of rigor to effect change.
In the CrowdStrike case, though, the Channel File 291 update went directly into the production environments of organizations across the world. Further, it did so without the benefit of those organizations' testing, staging, change control and other processes designed to vet the safety of changes in critical production environments.
Understanding the circumstances under which a vendor can make changes to its software lets us know when there's potential for an unexpected production change, which, in turn, influences the risk associated with bringing that software in-house.
Question 2. Can updates and changes be controlled, halted or gated?
While the previous question lets us know which products the manufacturer can change, that's only half the battle. The other important thing to know is whether we can control when those changes are released into controlled environments, such as production environments and industrial networks.
In some cases, we might want to rapidly release certain enhancements -- for example, a channel file -- to production and other high-criticality environments. Being able to control the cadence of those releases is beneficial, however. It enables us, for example, to review them in lower environments before they release to production or to test them on sensitive devices or legacy platforms before they release in full.
Question 3. Can we stagger updates?
Even when we understand where a vendor can make changes (question 1) and have a plan to test those changes (question 2), it's a smart idea to stagger changes over a period of time.
For example, if an organization has a server farm or cluster, it might choose to release an update to only a percentage of them at a time. That way, if there is an undesirable effect, redundant systems are still in place that are not impacted. It is beneficial to know the extent to which vendors support this. Some do; some don't.
Question 4. Under what circumstances can a vendor access our environments?
There are sometimes real and important reasons to allow vendor access into our environments, including to help with support, upgrades, issue resolution and configuration. For example, purpose-built equipment -- such as healthcare diagnostic imaging platforms, DNA sequencers, industrial controls systems and telecommunication routing equipment -- is often specialized with operational controls and telemetry not easily interpreted by nonspecialists.
Much like with the above questions, though, we as customers should understand the circumstances under which access is required, how we will be notified and how such access will occur.
Question 5. What records are kept of actions taken?
Just as we want to know under what circumstances and how vendor access to our systems can occur, we also want to ensure a record is kept of that access.
For example, ensure log files or access logs contain information about who accessed them and what actions they took. This is important for two reasons. First, we can track actions back to specific, individually identifiable personnel at the vendor if something goes awry or that access is misused. Second, in the event of an investigation, information about access helps rule out or identify impacts to critical platforms.
Question 6. What's the customer notification and alert process?
Let's face it: We aren't always great at keeping on top of product updates. Think about all the critical business applications in your organization. Have you read all the recent product support bulletins, patch notes, security vulnerability alerts, customer notification emails, product blogs and newsletters, and other sources of information associated with all those products and tools from the various vendors that supply them? Chances are the answer is no.
In the event of a critical issue, sustained outage or critical security event, it is valuable to have awareness of what channels vendors will use to contact us -- or for us to contact them -- to ensure we have a process to receive that information without having to chase it down.
Question 7. What are potential barriers to resumption?
Sometimes, situations arise where we use tools or products together in ways that have unexpected consequences and can make recovery from a situation like an outage more difficult.
One factor that added complexity to the CrowdStrike Channel File 291 incident was BitLocker, Windows' hard drive encryption features, on affected devices. The presence of BitLocker required additional steps -- for example, obtaining and entering recovery keys that often weren't readily available because of the outage.
It's worth spending some time examining and questioning what factors might make recovery more difficult or require extra steps to resolve when looking at vendors and risks.
Question 8. What is critical, and how critical is it?
This question is targeted toward internal teams, not suppliers or vendors. This information is, however, important -- specifically, information about what applications, processes and systems are critical and who owns them.
Pretty much all technology practitioners understand the importance of business continuity planning efforts, including instruments such as business impact analyses (BIAs) designed to elicit the above data points. Sometimes, in the crunch to get things done, however, these tasks can slip.
Practitioners must keep on top of BIA efforts to ensure we can react quickly and efficiently in the event of an outage.
Question 9. What's the internal emergency communication channel?
Also an internally focused question, it's important to have a clear internal communication strategy for outage-related issues. This should go without saying, but too many organizations have learned the hard way that the time to forge a new communication path isn't in the middle of a sustained and widespread outage.
Use this as an opportunity to revisit your crisis communications strategy for sufficiency.
Question 10. What's the threat model?
This is another question to ask internally rather than externally. It is useful to have an attacker's-eye view of systems and software brought into critical environments -- specifically by creating a threat model. Many of us are used to employing threat modeling for application security efforts. We can apply the same or similar process to services and tools that we don't necessarily write ourselves.
Namely, use threat modeling processes to create risk control strategies around vendors and business risk applications you bring in-house.
Ed Moyle is a technical writer with more than 25 years of experience in information security. He is currently CISO at Drake Software.