Velocity • Durability

How AI is forcing a complete rethink of global data storage infrastructure

How AI is forcing a complete rethink of global data storage infrastructure

Nov 5, 2025 (Intelligentdatacentres)AI is reshaping the future of data centres. With workloads scaling at unprecedented levels, traditional storage is no longer enough. In this article, VDURA CEO Ken Claffey explains why durability, lifecycle management and resilience are critical to the next wave of Digital Transformation – and how organisations can prepare for AI’s growing demands.

Surging demand for AI-ready infrastructure is driving another massive wave of investment into the global data centre sector. According to McKinsey, AI has become ‘the key driver of growth in demand for data centre capacity’ with overall requirements predicted to ‘almost triple by 2030, with about 70% of that demand coming from AI workloads.’

It’s not just the likes of McKinsey who are making these big predictions. A study from the World Economic Forum (WEF) currently values the global data centre industry at US$242.7 billion – a figure which it expects to more than double by 2032 to around US$584+ billion.

Wherever you look, the underlying theme is the same: AI is transforming data centres into one of the world’s fastest-growing infrastructure markets.

The levels of investment required to meet global demand are eye-watering. McKinsey claims that in a scenario where ‘accelerated growth’ dictates demand, the sector will require almost US$8 trillion in global data centre capex by 2030. The largest share of this figure (US$3.1 trillion) is expected to be channelled into semiconductors, servers and storage – a truly remarkable prospect for the technology developers and hardware suppliers at the core of the compute ecosystem.

Out with the old

Taking storage as an example, to meet the performance demands of AI systems, organisations everywhere are re-examining their approach to storage architecture. This is not without its challenges – traditional solutions have clearly been geared towards contemporary use cases, where databases and enterprise applications have created, in general terms, predictable sequential workloads.

As a result, and in contrast to the demands placed on storage by AI, organisations have benefited from a greater degree of predictable scale and timing around their storage provision strategies. In the enterprise IT context, for example, a payroll system might process transactions in overnight batches, with relatively uniform demands on storage. Reliability also matters, but for pre-AI storage use cases, access patterns are relatively uniform and don’t involve thousands of concurrent GPU processes hammering storage at once.

The arrival of advanced AI systems on the scale currently being seen is, however, game-changing. Training AI models is dependent on systems being able to read from massive, unstructured datasets (such as text, images, video and sensor logs, among many others) that are distributed and accessed in random, parallel bursts.

Instead of a few apps making steady or relatively predictable requests, a business might be running tens of thousands of GPU threads, all of which need storage that can deliver extremely high throughput, sustain low latency under pressure and handle concurrent access without becoming a bottleneck. When these requirements are not met, the impact on performance and cost can be immediate and severe.

For example, training a large language model can involve petabytes of text and image data being streamed to thousands of GPUs in parallel. If storage cannot feed that data at the required speed, the GPUs sit idle — burning through compute budgets that can run into millions of dollars for a single training run.

In contrast, inference workloads create a different kind of stress. In financial services, for instance, AI-driven fraud detection must analyse billions of transactions in real time, often within milliseconds. That requires storage systems capable of providing ultra-low latency access to massive historical datasets so that the model can compare live activity against years of prior patterns without delay.

The underlying point is that the legacy approach to storage is fundamentally unsuited to these performance extremes.

In with the new

For organisations reliant on HPC architectures, this is a well-trodden path. In the life sciences sector, for example, research organisations need uninterrupted access to genomic datasets that are measured in the petabytes. A great example is the UK Biobank, which claims to be the world’s most comprehensive dataset of biological, health and lifestyle information. It currently holds about 30 petabytes of biological and medical data on half a million people and is among many similar projects around the world gathering information at a tremendous rate.

In government, federal agencies face their own challenges in supporting programmes that cannot afford downtime or data loss. Mission-critical applications, such as intelligence analysis and defence simulations, demand 99.999% uptime, and even brief interruptions in availability can potentially compromise security or operational readiness. The list goes on – geophysics use cases face similar storage requirements, with a modern 3D seismic marine survey generating up to 10 petabytes of data for analysis, while in the financial services sector, fraud detection systems must process billions of transactions in real time to prevent losses that can run into millions of dollars per incident.

Suffice it to say that in each case, storage capacity and resilience have become central to determining the success or failure of cutting-edge technology initiatives.

Addressing these challenges requires more than just adding raw capacity. Instead, businesses are rethinking the lifecycle of their data to ensure it is stored, moved and retained in ways that support the performance demands of AI.

Effective lifecycle management can determine whether training runs are completed on schedule, whether inference workloads deliver results in time and whether overall costs remain sustainable, among other variables.

As a result, storage tiering is taking on a greater role in ensuring data is placed in the most appropriate storage architecture, primarily according to its value and frequency of use. For instance, high-performance systems are reserved for the datasets that must be accessed often or at speed, while less critical data is moved to lower-cost environments. This avoids unnecessary expenditure on premium capacity and helps organisations maintain control over spiralling storage footprints, particularly when they need to balance storage assigned to AI use cases compared to other priorities.

Equally important is the ability to retrieve archived data when required. AI workloads often need to revisit older datasets to refine models or support new lines of inquiry. Without efficient systems for recall, projects risk delay and duplication of effort. More effective data lifecycle management can help ensure that data is not only captured but also kept accessible.

Built to last

In this context, storage reliability is dependent on ensuring data is durable, accessible when needed, and that there are recovery processes in place that can restore service quickly. Indeed, durability is now fundamental to AI performance. According to a recent study by Gartner, ‘through 2026, organisations will abandon 60% of AI projects unsupported by AI-ready data.’ They go on to state that ‘Sixty-three percent … either do not have or are unsure if they have the right data management practices for AI,’ a situation that ‘will endanger the success of their AI efforts.’

For any organisation investing heavily in AI, that should make for interesting reading. Large-scale AI workloads rely on reliable distributed storage systems to provide uninterrupted access to training data. Even brief disruptions, such as metadata server failures or network timeouts or more problematic issues relating to data loss, can have a disastrous impact on overall performance and reliability. When failures occur, the cost is not just technical, but can also result in reputational damage, regulatory exposure and direct financial loss.

Poor data quality already drains US$12.9 – US$15 million per enterprise annually, while data pipeline failures cost enterprises around US$300,000 per hour (US$5,000 per minute) in lost insight and missed SLAs. In AI environments, that translates directly into stalled model training, wasted GPU resources and delayed time-to-value.

To illustrate, consider the storage issues facing a university research team running simulations on large datasets. If their systems aren’t durable, a single outage could distort results and delay development.

But with durable infrastructure in place, their work continues without interruption, and long-term goals stay on track. The same scenario now applies across AI development and integration projects, but on a much broader scale.

Durability also helps explain why so few AI projects reach production. Given that under a third of enterprises have integrated data silos well enough to support Gen AI, it’s hardly surprising that only 48% of AI projects ever make it into production and that 65% of Chief Data Officers say this year’s AI goals are unachievable, with almost all (98%) reporting major data-quality incidents. Failures of storage reliability, data quality and resilience are at the heart of this shortfall.

So where does this leave organisations that have staked so much on the success of their AI strategies?

At a foundational level, meeting the storage performance and reliability requirements to deliver AI tools that work as intended depends on using technologies that go beyond traditional performance levels. Hybrid systems that integrate SSD speed with HDD capacity, or all-flash systems for latency-sensitive workloads, both have a role to play. What matters is not the choice of medium alone, but the measures taken to ensure durability — continuous monitoring, automated integrity checks and regular recovery testing.

Success also depends on embedding resilience into the core of AI operations. At a technology level, Multi-Level Erasure Coding (MLEC) provides greater fault tolerance than traditional RAID by offering protection against multiple simultaneous failures. For those handling petabyte-scale datasets, combining MLEC with a hybrid architecture can provide an optimal balance. Where real-time access is critical, all-flash systems deliver the lowest latency, albeit with the associated higher cost.

Operational measures are equally important. Automated data integrity checks can detect and isolate corruption before it enters the training pipeline. Regularly scheduled recovery drills, designed to simulate realistic fault conditions, ensure that restoration processes can be executed within the tight timeframes AI production demands. By aligning these measures with data governance and compliance frameworks, organisations can minimise both technical risk and regulatory exposure.

Looking ahead, AI workloads will continue to scale, and so will the storage systems that support them. Ideally, architectures should be modular, enabling capacity or performance components to be added without wholesale replacement. Here, vendor-neutral solutions help to avoid lock-in, ensuring that infrastructure can adapt to new technologies such as higher-density storage media or more advanced fault-tolerance requirements.

To minimise risk, scalability should always be planned with an eye on both data growth and workload evolution. This includes anticipating the arrival of more complex AI models and use cases, both of which may change performance priorities.

Without the right technologies in place, however, we’ll undoubtedly see more headlines around the failure of AI investments to deliver.

Get it right and organisations can look forward to a win-win scenario where storage performance and reliability support our increasing reliance on AI.