AI disaster recovery planning is years behind AI adoption

Table of Contents ( Press the ← key in browser search bar to return TOC)

Before the current wave of AI adoption, disaster recovery focused on backing up and restoring enterprise applications, databases and all the components of traditional IT infrastructure.

That remains the case today, but enterprises must now also think about AI models, prompts and agents. Can those resources be restored, and how can enterprises verify they remain trustworthy once they are?

“The honest summary is that most organizations’ DR plans in this space are years behind AI adoption,” said Greg Sarich, CIO of Quest Software.

If CIOs and CISOs are going to help their enterprises catch up, they must figure out how to update their disaster recovery plans and test them in advance of real-world incidents.

Disaster in the AI era

When an enterprise is hit with a security incident or an outage in today’s AI-infused environment, the disaster recovery team has a lot to consider, including:

Was the data used to train AI systems compromised?
Was it exfiltrated?
Were AI models poisoned?
Were prompts compromised?

Having the necessary visibility to answer these questions is a challenge, given how much AI touches across enterprise tech stacks.

“If you’re using Claude, it might be touching your Salesforce system and your SharePoint … your Outlook system and other data that you might have in, let’s say, a Snowflake or something else where you have business-critical data,” Sarich said, illustrating how AI creates a web of interconnected dependencies.

“It’s not only the protection of those systems, but it’s all these little intersections that it touches along the way to be able to pull data and then create an outcome,” he said.

As AI becomes more embedded in business processes, enterprises risk operations grinding to a halt, particularly if their teams can no longer revert to manual alternatives.

“If we take an AI assistant copilot or chatbot that goes down, we lose access to the institutional knowledge that our employees are counting on,” said Mehdi Houdaigui, principal, cyber AI leader at Deloitte.

Risk still exists once AI resources are back up and running after an incident. Enterprises must verify the integrity of these resources, but compromises involving underlying data, prompts or models can be difficult to detect.

“The challenge we see there is that the AI might still work. It may still look like, to the untrained eye, that it’s producing confident answers, but those answers may be wrong, incomplete or manipulated,” Houdaigui said.

An enterprise may be able to restore a chatbot, for example, but the disaster continues if people are acting on compromised information it provides.

The blast radius can be considerably larger with AI agents in the picture. “Depending on how sophisticated the agent is, it’s no longer just one system for it to be able to do what it’s intended to do. It has the ability to touch and potentially take action on multiple systems,” Houdaigui added.

The damage can linger long after disaster recovery teams clean up compromised AI agents operating across multiple systems.

“If our employees, our organizations lose confidence in the tools themselves, you’ve got a big gap in just getting further adoption going forward,” Sarich said.

The challenge we see there is that the AI might still work. It may still look like, to the untrained eye, that it’s producing confident answers, but those answers may be wrong, incomplete or manipulated. — Mehdi Houdaigui, principal, cyber AI leader, Deloitte

Building an AI disaster recovery plan

As CIOs and CISOs consider how their DR plans need to evolve in response to AI, there are some fundamental steps to help them get started:

Catalog your AI assets. With AI proliferating across different business units — and shadow AI adding another layer of complexity — it can be difficult to have a full understanding of what tools are being used where.

“Start with an AI asset inventory. If you don’t have one, you’ve got to build one quick,” Sarich said. “You can’t recover what you haven’t cataloged.”

Determine each asset’s business criticality. “Anything that’s related to or has AI as part of its foundation in the operation of the business should be priority-one or red-level,” said Chris Millington, global solutions lead, data and cyber resilience at Hitachi Vantara. Customer-facing tools and those that affect revenue have a higher priority, according to Sarich.

Map dependencies. With AI deeply integrated into enterprises’ workflows, it is essential to understand its dependencies. “What data does it use? What model does it rely on? What vendor or vendors are involved? What are the systems that it can access? And most importantly, what credentials does it use?” Houdaigui asked.

Evaluate permissions. To effectively recover, IT and security leaders need to know the permissions AI agents and tools have and be able to revoke credentials and kill specific tasks. Then, those AI assets need to be evaluated before they are restored and given permissions again.

“[Verifying] that that agent is operating within what we call these approved boundaries before it goes back online is critical from a disaster recovery perspective,”Houdaigui said.

Define recovery objectives. Organizations need to define their recovery time objective and recovery point objective, Houdaigui noted. How much data and downtime related to AI assets can an enterprise afford to lose? What is the last known trusted version of a model, prompts and data?

DR plans also need to define the necessary testing and validation steps before recovering and bringing AI infrastructure back online.

“There are significantly more steps involved with AI systems because of the complexity that the systems have inherently just by being probabilistic in nature,” Houdaigui explained.

Test and validate. A disaster recovery plan is of little use to anyone if it sits on a shelf collecting dust until the panic of an incident. Testing is key, and annual or quarterly tests are inadequate, given the pace of AI change. New tools, new dependencies and new risks are part and parcel of the AI era.

As enterprises test, they need to consider all the potential gaps in their DR plans and fill them.

“Ask what happens if the knowledge base is corrupted or if we lose access to one of the LLM models; APIs are unavailable for whatever reason. What happens if an agent behaves unexpectedly, or if we have any instances of potential compromise where we don’t believe the logs can be trusted?” Houdaigui said. “Those exercises will … help to reveal gaps fairly quickly.”

When disaster strikes

As much as AI is changing operations, the old cybersecurity adage, “It’s not if, it’s when,” remains the same. If AI deployment continues to outstrip governance, incidents that stem from and affect agents and tools are going to happen.

Recent research from Proofpoint found that 42% of 1,400 security professionals surveyed have experienced AI-related incidents, either suspicious or confirmed. Additionally, 52% of the surveyed security professionals said they do not have full confidence that their organizations’ security controls could detect compromised AI.

Enterprises are already contending with incidents that impact their AI resources, and Sarich anticipates that sooner or later there will be a significant event that thrusts AI disaster recovery into the spotlight.

“We’re going to see something major happening, I’m sure, in the not-too-distant future,” he said.

Whether it is a large-scale public event or not, enterprises will have to turn to their disaster recovery plans, work through them and then conduct a postmortem to make that plan stronger for the next incident. Enterprise teams will have to ask key questions like, “What point did we recover back to and was that acceptable, or can I optimize that even further?” Millington said.

The missing metric in AI resilience

As disaster recovery strategies mature in response to the complexity of enterprise AI, a big question remains unanswered:

Can enterprises quantify the losses associated with an outage, breach or other incident that affects their AI resources?

Houdaigui argued that the industry has yet to align on how to quantify cyber risk, let alone on losses associated with AI. “There is an opportunity for the industry as a whole to really look at: What is the quantifiable loss exposure or risk impact of these systems?” he added.

As enterprises gain a clearer understanding of the operational and financial consequences of AI-related incidents, the cost of disaster recovery and resilience may finally begin to catch up with AI deployment.

Original Post>

Enjoyed this article? Sign up for our newsletter to receive regular insights and stay connected.

Disaster in the AI era

Building an AI disaster recovery plan

When disaster strikes

The missing metric in AI resilience

Like this:

Related

Leave a ReplyCancel reply

Disaster in the AI era

Building an AI disaster recovery plan

When disaster strikes

The missing metric in AI resilience

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Global Intelligence and Insight Platform: IT Innovation, ETF Investment, plus Health Wellbeing