Covering Disruptive Technology Powering Business in The Digital Age

image
How to Mitigate the Risk of Systemic Software Failure
image
September 19, 2024 Bylines

Written by Bla Sweeney, Keysight Technologies

 

Software

Bla Sweeney, Keysight Technologies

The CrowdStrike outage was a vivid reminder of how interconnected the world’s systems are – and how dramatically every organization can be affected. So in the aftermath of the outage, every C suite needs to be asking: how do we mitigate the risk of systemic software failure in our organization?

Before answering, let’s look at how and why we became so reliant on third party software.

The Original Software Development Lifecycle

Not too many years ago, the software development lifecycle (SDLC) took months, if not years.

Software was installed on-premise. We would only deploy it after extensive and exhaustive testing.

Each time we wanted to upgrade the software to a newer, stable release, we’d go through the same process again. Some organizations did this each year. Many more waited several years because the investment in time and money was too much to consider on an annual basis.

The process was incredibly costly and inefficient. On the other hand, we were in control of our own destinies when something went wrong.

What changed?

Every Company is a Tech Company

Well, everything changed.

In the past, an organization might have had a single ERP system.

Today, multiple software tools underpin every business. We have the core IT we depend on to ‘keep the lights on’. We also have the IT that each individual department or business unit uses – manufacturing apps, product design tools, customer support portals and so on, each one likely talking to the others.

As the saying goes, every company is a tech company now.

The CIO is still expected to have oversight of all the tools in use. The size of the task has altered out of all recognition. Yet, at the same time, their task each year is to achieve all three of better, faster, and cheaper.

In today’s world, time and resource-intensive ways of working are no longer viable. It simply isn’t feasible to have the same monolithic update processes we used when we only relied on a handful of systems.

SaaS Helps Us Achieve Better, Faster, and Cheaper

The software as a service (SaaS) model provided the answer that was needed, allowing us to outsource software maintenance and updates to third parties.

It has helped CIOs achieve the seemingly impossible and deliver the holy grail of better, faster, and cheaper. By 2022, a typical organization used 130 SaaS applications.

And because updates are rolled out monthly, weekly, or even daily, organizations are always harnessing the best the technology has to offer.

On the other hand, when something goes wrong, we can no longer fix it by going down to the basement and ‘turning it off and on again’.

When there’s a bug, the IT team is dependent on the third party to find it, fix it, and roll out a revised update.

When an upgrade causes conflicts with other systems, IT teams are forced to be reactive, developing a solution that will resolve the clash.

When a tool used by a large percentage of the world goes wrong, chaos ensues, as we saw recently.

The SaaS model undoubtedly brings huge benefits. The modern world of business couldn’t and wouldn’t exist without it. But organizations are no longer as proactive or in control as they would like to be.

So what’s the answer?

Resilience is the Answer

We need to focus on resilience in IT and put in place the processes that  allow us to take back control. Here are four ways to think about it.

Fail Forward

Firstly, it’s a fact of life that in a world where we’re all dependent on software there will always be bugs. The only variable is how serious they are.

The task, therefore, is to understand what our options are if something fails.

If you’re a company that releases software, what’s your testing strategy? What’s your rollout strategy? How do you revert to a previous stable release if something goes wrong?

If you’re a company that relies on software, what are your options when something fails? What’s your fallback position?

Have Someone Play Devil’s Advocate

The dangers of group-think are well-documented. In any decision-making process, including software investment decisions, make sure there’s someone playing devil’s advocate. Why do we need this software? What are the alternatives? What due diligence have we done on the provider?

Prioritize Transparency

The CrowdStrike outage showed very clearly that organizations rely on systems that rely on other systems that rely on other systems. Entire infrastructures depend on modems in data centers that people rarely – if ever – visit.

We need to demand a new level of openness and transparency that allows us to look ‘under the hood’ rather than trusting our providers to look after it.

As part of this, we should remember that cheaper rarely means better. We must be confident that a lower price doesn’t mean lower standards.

Introduce Quality Resilience Engineering

Finally, there’s scope for an entirely new role, one that’s tasked with engineering quality into our systems and developing the back-up plan for when things go wrong.

On a day-to-day basis, engineers focusing on quality resilience, are using digital twins and tools such as Eggplant MonitoringEggplant Test to stay on top of their testing.

At a strategic level, they’re looking beyond software development lifecycle and IT operations management to the bigger picture of an organization and its systems.

Their role is to put organizations back in control.

Not Old Days or New Days, Just Different Days

Our modern tech-enabled world depends on huge numbers of software systems. We can’t go back to the ‘old days’ where we had complete control – and nor would we want to. But we do have to think about how we can build resilience into our systems and minimize the risks at both a systemic level and an organizational level. I hope our four suggestions provide food for thought as we pursue this quest.

(0)(0)

Archive