The mass system outage which swept the globe on Friday 19 July is shaping up to be one of the biggest (and most costly) IT disruptions in history. The outage, caused by a defective Crowdstrike update which took down more than 8.5 million Windows PCs and servers globally, grounded planes, paralysed hospitals, shut down payment processors and forced broadcasters off air on multiple continents.
For many organisations, recovery was a long and stressful night for tech teams. However for others, full recovery will likely take weeks or even months.
The true cost of the incident itself and the contagion risk which followed will nearly be impossible to estimate, and there will be more extensive analysis to come. However, as the fallout of the incident settles and the media cycle quickly moves on, there are a number of immediate lessons security leaders can and must apply to their day-to-day operations to set up their teams for the future.
Regardless of the impact on enterprise security programmes – either direct, indirect or near non-existent – the outage serves as a crucial reminder on a number of universal principles which apply to all organisations.
Here are four key takeaways from the Human Risks team:
One: It’s Time to Review Your ‘Known’ Risks
For many of those who woke up to significant disruptions on Friday morning, a major third-party system outage will have been sitting in the top left-hand corner of their risk registers for years.
An accepted or ‘known’ risk – largely dismissed or even ignored in the too hard basket while teams focus on other things.
However, as the Microsoft outage aptly illustrated, simply identifying a risk as highly unlikely / high impact changes nothing. Systemic events do occur, are often under-represented, and are in need of critical attention.
The same can be said for a number of critical security threats, which often sit underrepresented on risk registers for years before wake-up calls (or worse) eventuate.
For IT Leaders, the time to review critical supplier risks was before the incident occurred. For the wider risk and resilience community – the time to determine what is truly acceptable amongst your known risks, and ensure effective controls are in place, is now.
Two: If You’re Not Already Undertaking Business Impact Analysis – Do So
Clearly documenting the impacts of Friday’s outage on your organisation, including the downstream impacts across your security protocols, is crucial to organisational learning for the future.
How did the outage impact the overall security posture of your organisation? What critical dependencies did it highlight across your key management processes? Did your controls fail, and why?
Effective business impact analysis should not be done simply because operational resilience regulations such as DORA may require it. Direct knowledge of the incident often sits within the heads of those who responded, limiting the opportunity for insights to be appropriately shared (and implemented) with wider teams.
Ensuring all impacts are well documented to inform scenario planning for the future is crucial for organisational resiliency. And the value isn’t in a well-crafted executive report – it’s in the process of the analysis itself, and the resilience culture which stems from it.
Three: Risk Assessments Can’t Be Static
Systemic outages highlight the critical interdependencies between sites and assets across any organisation. While a major event may have been managed centrally by one team – the frontline impacts are felt by many.
Critical controls fail, undertreated threats are exposed, and the organisational security posture in a time of disruption is laid bare.
In short: Annual risk assessments and exercises simply don’t cut it in today’s interconnected environment. Teams and champions across your network need to be continually raising questions, sharing insights, and reviewing the protocols they have in place.
For Human Risks customers, you’re already set up to ensure that risk assessments remain living documents closely connected to site teams. Ensure that proactive reviews and tasks are assigned effectively – and as always, reach out to us for any support needed.
And for all security leaders – ensure that your teams are equipped to easily review the assessments they have in place and share new insights. Is a critical security control less effective in a time of crisis than previously identified? How many sites have that control in place, and are they aware of this?
Strategically and operationally – risk assessments must evolve over time as context changes to add value. They can’t sit static in the top drawer for the next annual compliance review.
Four: Never let a good crisis go to waste
Friday’s Crowdstrike outage is a wakeup call for executive teams globally.
For all those championing more effective proactive security risk management efforts, operational resiliency planning, and comprehensive business impact analysis – use it.
The most valuable crisis for any organisation is the one which is learned from, either to reassess the size and scope of resiliency efforts or to bolster controls for the next inevitable event.
So if a proactive review of resiliency measures hasn’t already been pushed for, now is your time to take the lead as an internal advocate. To strengthen your security posture, and set your teams up for success in the future.
Interested in Learning More About How Human Risks?
Human Risks works with organisations globally to enable security teams and reinforce operational resilience. Reach out to the team for a demo – and subscribe for future updates on how we’re working to make security risk management smarter.