Velocity • Durability

The Resilience Gap Holding Back AI Performance

Chirs Girard

Disaster Recovery Journal – Global spending on AI is growing at an enormous rate; yet, the underlying narrative is shifting, away from the early hype cycle and more towards the realities of delivering ROI. According to Gartner, for example, through 2026, 60% of AI projects will be abandoned if they are not supported by AI-ready data, while 63% of organizations either lack or are unsure whether they have the right data management practices.

So, what’s going on here? In many cases, the problem is not the AI algorithms or the compute power per se; it lies in the storage and data management infrastructure that supports them. Fundamentally, AI models rely on fast, uninterrupted access to reliable data; however, weaknesses in durability, availability, and recovery design often lead to costly interruptions.

Even brief failures can stall training, delay model delivery, and drain budgets. Lost compute leaves GPUs idle while recovery teams assess damage. Retraining requires lost or corrupted data to be rebuilt and reintroduced, adding days or weeks to project timelines. Operations teams spend unplanned time restoring systems instead of advancing new initiatives, while every delay reduces productivity, erodes revenue, and impacts model ROI. Collectively, these challenges create a resilience gap that limits the success and scalability of AI projects.

Where AI infrastructure falls short

AI workloads are extremely demanding in terms of data consumption. Training large models requires uninterrupted access to enormous datasets and the ability to process them repeatedly without delay or corruption. Yet, many enterprise storage environments were never designed to sustain this intensity. Even short interruptions, such as device/server failures or data corruption, can disrupt learning cycles, distort outputs, and create a backlog that slows deployment.

The operational and financial impact can be significant. In broader terms, poor data quality is an issue that already costs enterprises between $12.9 million and $15 million per year. Meanwhile, data pipeline failures cost enterprises around $300,000 per hour in lost insight and missed SLAs. These issues can be costly to put right. Even brief disruptions, such as metadata server failures, network timeouts, or even more problematic issues relating to data loss, can have a disastrous impact on overall performance and reliability. In this context, storage reliability depends on ensuring data is durable, accessible when needed, and recovery processes are in place to restore service quickly.

A big part of the challenge is that legacy storage architectures built around static tiering and high-availability pairs cannot deliver the durability or resilience modern AI demands. Not only do they introduce latency and restrict scalability, but they also expose workloads to unnecessary risk. For any organization expecting fast ROI from its AI investment (i.e. most of them), this fragility represents a hidden bottleneck where performance losses and reliability failures converge in a perfect storm.

Closing the resilience gap

Addressing this resilience gap is highly dependent on rethinking how data infrastructure is designed, monitored, and, when necessary, recovered. In practical terms, this means moving away from legacy architectures and towards systems purpose-built for data-intensive AI workloads. For example, hybrid storage, which combines SSD performance with HDD capacity, is an excellent choice for organizations that need to balance speed, scale, and cost. Alternatively, for latency-sensitive training or real-time inference use cases, all-flash architectures offer faster throughput at a higher price.

In addition, the resilience of any AI environment also depends on how effectively it can tolerate faults and recover from them. At a technological level, multi-level erasure coding (MLEC) offers greater fault tolerance than traditional RAID by protecting against multiple simultaneous device failures. For organizations handling petabyte-scale datasets (not uncommon in the context of AI), combining MLEC with a hybrid architecture can provide a strong performance-protection balance.

Operational processes are just as important. For example, automated data integrity checks can detect and isolate corruption before it enters the training pipeline, while regularly scheduled recovery drills, designed to simulate realistic fault conditions, can ensure restoration processes can be executed within the tight timeframes AI production demands.

Collectively, these measures should be embedded as ongoing operational practices rather than one-time efforts. As part of this process, continuous monitoring, regular fault simulation, and alignment with governance frameworks can all help ensure resilience becomes a measurable attribute of AI infrastructure, rather than an afterthought in the event of system failure.

Planning ahead

Given the increasing emphasis on AI systems globally, these issues will only become more significant over time. AI infrastructure must evolve in line with the workloads it supports and, as data volumes increase and models grow more complex, scalability and flexibility will continue to exert a major influence over whether projects are successful or not. For many, the best approach will be to adopt modular, vendor-neutral architectures which can expand capacity or performance without full system replacement.

Ideally, performance and resilience planning will anticipate future data growth and changing fault-tolerance requirements, balancing these against other priorities. By treating resilience as an ongoing discipline rather than a static goal or a reactive process, organizations can build AI environments to adapt and consistently deliver organizational objectives.