Rethinking Storage Reliability in the Age of Exascale AI

Rethinking Storage Reliability

Ken Claffey, Aug 27, 2025 (AIJ Guest Post) –According to a recent study by Gartner, “through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data.” They go on to state that “Sixty-three percent . . . either do not have or are unsure if they have the right data management practices for AI”, a situation that “will endanger the success of their AI efforts.” 

For any organization investing in AI, this should make for concerning reading. Central to the challenge is that large-scale AI workloads rely heavily on reliable distributed storage systems to provide uninterrupted access to training data. Even brief disruptions, such as metadata server failures or network timeouts or even more problematic, while issues relating to data loss can have a disastrous impact on overall performance and reliability.   

These issues can be extremely expensive to put right, and are easily exacerbated by reputational damage and regulatory action if the fallout is at the extreme end of the scale. Looking more closely at some of the numbers brings the issues into sharp focus. For instance, poor data quality already drains $12.9 – $15 million per enterprise annually, while data pipeline failures cost enterprises around $300,000 per hour in lost insight and missed SLAs. 

In this context, storage reliability is dependent on ensuring data is durable, accessible when needed, and that there are recovery processes in place that can restore service quickly. Taken together, how a system is architected to address reliability, availability, and durability determines how resilient the system is. 

By way of comparison, consider a university research team running simulations on large datasets, for example. If their systems aren’t durable, a single outage could distort results and delay development. But with durable infrastructure in place, their work continues without interruption, and long-term goals stay on track. These are important dependencies: durability is required for a system to achieve availability, but a system can be durable and not be available. The same scenario now applies across AI development and integration projects, but on a much broader scale. 

Given that under a third of enterprises have integrated data silos well enough to support Gen AI, it’s hardly surprising that only 48% of AI projects ever make it into production and that 65% of Chief Data Officers say this year’s AI goals are unachievable, with almost all (98%) reporting major data-quality incidents. 

Building a solid foundation 

So where does this leave organizations that have staked so much on the success of their AI strategies? At a foundational level, meeting the storage performance and reliability requirements to deliver AI tools that work as intended depends on using technologies that go beyond traditional performance levels. 

To illustrate, from a hardware infrastructure perspective, organizations are typically faced with a choice between hybrid storage technologies, which integrate SSD performance with HDD capacity or all-flash systems. The approach they eventually take will clearly depend on considerations such as performance demands and budget, but in either case, durability must be actively addressed via continuous monitoring and regular recovery testing processes.   

With that issue covered, success also depends on embedding resilience into the core of AI operations. At a technology level, Multi-Level Erasure Coding (MLEC) provides greater fault tolerance than traditional RAID by offering protection against multiple simultaneous failures. For those handling petabyte-scale datasets, combining MLEC with a hybrid architecture can provide an optimal balance. Where real-time access is critical, all-flash systems deliver the lowest latency, albeit with the associated higher cost. 

Operational measures are equally important. Automated data integrity checks can detect and isolate corruption before it enters the training pipeline. Regularly scheduled recovery drills, designed to simulate realistic fault conditions, ensure that restoration processes can be executed within the tight timeframes AI production demands. By aligning these measures with data governance and compliance frameworks, organizations can minimize both technical risk and regulatory exposure. 

A long-term approach 

Looking ahead, AI workloads will continue to scale, and so too must the storage systems that support them. Ideally, architectures should be modular, enabling capacity or performance components to be added without wholesale replacement. Here, vendor-neutral solutions help to avoid lock-in, ensuring that infrastructure can adapt to new technologies such as higher-density storage media or more advanced fault-tolerance requirements. 

To minimize risk, scalability should always be planned with an eye on both data growth and workload evolution. This includes anticipating the arrival of more complex AI models and use cases, both of which may change performance priorities. By doing so, organizations can deliver a win-win of storage reliability that supports a growing reliance on AI.