Velocity • Durability

AI driving data center – and storage – transformation

AI driving data center – and storage – transformation

Source: Chris Mellor — (Block and Files)

Research house Silicon Angle says the Gen AI surge is driving an evolution away from traditional data centers to ones that are accelerated and far more scalable, and have an agentic control plane.

These are, its study says, AI factories, and the data center technology stack is “flipping away from general CPU-centric systems to GPU-centric accelerated compute, optimized for parallel operations and purpose built for artificial intelligence.”

This switchover is happening because traditional data centers can’t support Gen AI workloads and automated governed data planes built for AI throughput at massive scale. Thse workloads need AI factrories that “transform raw data into versatile AI outputs – for example, text, images, code, video and tokens – through automated, end-to-end processes. Those processes integrate data pipelines, model training, inference, deployment, monitoring and continuous improvement so intelligence is produced at massive scale,” Silicon Angle writes.

In these types of systems, the analysts write, storage is disaggregated by default. “High-performance I/O uses NVMe and parallel file systems for checkpointing and shard reads. Less active tiers use cheaper object stores for datasets, models and artifacts; archive tiers retain lineage and snapshot versions. Ultra-high-performance data movers prefetch and stage data to keep GPUs busy with small files and metadata prioritized to keep data flowing.”

Most enterprise-type organizations won’t build their own AI factories. Instead “they will consume application programming interfaces and software built on top by firms such as OpenAI, Anthropic PBC, other AI labs and cloud players.” Silicon Angle suggests “enterprise AI will be adopted through access to mega AI factories via APIs and connectors with a software layer that hides underlying primitives and tools complexity that live under the hood.”

This is a devastating outlook for on-prem AI factories, saying that, basically, they won’t exist. Hmm. We think that concerns over proprietary, private data exposure may be being downplayed here, and that, if you take generalized AI training out of the equation, mini-AI factories could well exist in enterprise-level organizations.

Looking at this from the storage system point of view, which suppliers are well-positioned for such an AI factory future?

A quick run-down:

  • DDN – Deeply in bed with Nvidia and AI compute service suppliers and flush with $300 million of Blackstone private equity cash for AI-related development.
  • Dell – Has all the hardware and AI Data Platform needed but lacks disaggregated storage. Project Lightning should fix that.
  • NetApp – Has responded to this Silicon Angle picture with its AFX disaggregated array and AIDE software.
  • HPE – Has adopted disaggregated storage with its Alletra Storage MP line – and has the compute and networking needed. But we can’t see an AI data stack yet.
  • Hitachi Vantara – There is AI-related product news coming as this enterprise storage supplier rediscovers its mojo and is regaining lost ground fast. No disaggregated storage yet.
  • IBM – this mainframe locked-in customer base supplier is riding the AI train as hard as it can. No disaggregated storage hardware yet.
  • Pure Storage – Has adopted disaggregated storage with FlashBlade//EXA and has AI-focussed data set management ideas.
  • VAST Data – The pioneer of the AI factory storage approach, with its AI OS and DASE storage.
  • VDURA – It says its in the AI data factory game and has hired parallel file system developer Garth Gibson as its first chief technology and AI officer (CTAIO) to reinvent the storage stack for AI.
  • WEKA – It has the fast data delivery needed with its Neural Mesh and Augmented Memory Grid. But we don’t see any upper level AI data stack activity.

Cloud file services suppliers such as CTERA and Nasuni are racing towards using their stored data to feed customer’s AI data pipelines. Object storage suppliers, like Cloudian, MinIO and Scality are doing the same, with S3-over-RDMA a key base feature and AI data pipeline feature addition appearing in their roadmaps.

The main data protection vendors are adding AI data pipeline features along with a string dose of AI data cyber-resilience. Think Cohesity, Commvault, Rubrik and Veeam as examples.

Software-defined storage suppliers, ones with no hardware capability, have, in our view, a difficult time when trying to adopt a disaggregated storage hardware/software stack, as they have no access to disaggregated commodity hardware. To be precise, they can easily access the compute servers needed, but the storage nodes and the commodity-based fast internal networks to connect them to the compute nodes, together with the metadata control software needed, are different things entirely.

Sure, they can source a NVMe flash JBOD, stick an RDMA-capable NIC in it, and get themselves a fast network switch, but the metadata software is key, and is most decidedly not simple. Datacore, for example, has bought a parallel file system company, Arcastream, and an AI edge-focussed HCI supplier, Starwind. It’s getting its AI lego block pieces together but hasn’t yet got a fully formed offering.