Written by Bla Sweeney, Keysight Technologies
The CrowdStrike outage was a vivid reminder of how interconnected the world’s systems are – and how dramatically every organization can be affected. So in the aftermath of the outage, every C suite needs to be asking: how do we mitigate the risk of systemic software failure in our organization?
Before answering, let’s look at how and why we became so reliant on third party software.
The Original Software Development Lifecycle
Not too many years ago, the software development lifecycle (SDLC) took months, if not years.
Software was installed on-premise. We would only deploy it after extensive and exhaustive testing.
Each time we wanted to upgrade the software to a newer, stable release, we’d go through the same process again. Some organizations did this each year. Many more waited several years because the investment in time and money was too much to consider on an annual basis.
The process was incredibly costly and inefficient. On the other hand, we were in control of our own destinies when something went wrong.
What changed?
Every Company is a Tech Company
Well, everything changed.
In the past, an organization might have had a single ERP system.
Today, multiple software tools underpin every business. We have the core IT we depend on to ‘keep the lights on’. We also have the IT that each individual department or business unit uses – manufacturing apps, product design tools, customer support portals and so on, each one likely talking to the others.
As the saying goes, every company is a tech company now.
The CIO is still expected to have oversight of all the tools in use. The size of the task has altered out of all recognition. Yet, at the same time, their task each year is to achieve all three of better, faster, and cheaper.
In today’s world, time and resource-intensive ways of working are no longer viable. It simply isn’t feasible to have the same monolithic update processes we used when we only relied on a handful of systems.
SaaS Helps Us Achieve Better, Faster, and Cheaper
The software as a service (SaaS) model provided the answer that was needed, allowing us to outsource software maintenance and updates to third parties.
It has helped CIOs achieve the seemingly impossible and deliver the holy grail of better, faster, and cheaper. By 2022, a typical organization used 130 SaaS applications.
And because updates are rolled out monthly, weekly, or even daily, organizations are always harnessing the best the technology has to offer.
On the other hand, when something goes wrong, we can no longer fix it by going down to the basement and ‘turning it off and on again’.
When there’s a bug, the IT team is dependent on the third party to find it, fix it, and roll out a revised update.
When an upgrade causes conflicts with other systems, IT teams are forced to be reactive, developing a solution that will resolve the clash.
When a tool used by a large percentage of the world goes wrong, chaos ensues, as we saw recently.
The SaaS model undoubtedly brings huge benefits. The modern world of business couldn’t and wouldn’t exist without it. But organizations are no longer as proactive or in control as they would like to be.
So what’s the answer?
Resilience is the Answer
We need to focus on resilience in IT and put in place the processes that allow us to take back control. Here are four ways to think about it.
Fail Forward
Firstly, it’s a fact of life that in a world where we’re all dependent on software there will always be bugs. The only variable is how serious they are.
The task, therefore, is to understand what our options are if something fails.
If you’re a company that releases software, what’s your testing strategy? What’s your rollout strategy? How do you revert to a previous stable release if something goes wrong?
If you’re a company that relies on software, what are your options when something fails? What’s your fallback position?
Have Someone Play Devil’s Advocate
The dangers of group-think are well-documented. In any decision-making process, including software investment decisions, make sure there’s someone playing devil’s advocate. Why do we need this software? What are the alternatives? What due diligence have we done on the provider?
Prioritize Transparency
The CrowdStrike outage showed very clearly that organizations rely on systems that rely on other systems that rely on other systems. Entire infrastructures depend on modems in data centers that people rarely – if ever – visit.
We need to demand a new level of openness and transparency that allows us to look ‘under the hood’ rather than trusting our providers to look after it.
As part of this, we should remember that cheaper rarely means better. We must be confident that a lower price doesn’t mean lower standards.
Introduce Quality Resilience Engineering
Finally, there’s scope for an entirely new role, one that’s tasked with engineering quality into our systems and developing the back-up plan for when things go wrong.
On a day-to-day basis, engineers focusing on quality resilience, are using digital twins and tools such as Eggplant Monitoring, Eggplant Test to stay on top of their testing.
At a strategic level, they’re looking beyond software development lifecycle and IT operations management to the bigger picture of an organization and its systems.
Their role is to put organizations back in control.
Not Old Days or New Days, Just Different Days
Our modern tech-enabled world depends on huge numbers of software systems. We can’t go back to the ‘old days’ where we had complete control – and nor would we want to. But we do have to think about how we can build resilience into our systems and minimize the risks at both a systemic level and an organizational level. I hope our four suggestions provide food for thought as we pursue this quest.
Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)