AI Performance Myths: Do IOPS Actually Matter?

By Petros Koutoupis, Product Manager, VDURA , Oct 2, 2025 (LinkedIn Blog Post)With all the buzz around Artificial Intelligence (AI) and Machine Learning (ML), it’s easy to lose sight of which High-Performance Computing (HPC) storage requirements are essential to deliver real, transformative value for your organization.

When evaluating a data storage solution, one of the most common performance metrics is Input/Output Operations Per Second (IOPS). It has long been the standard for measuring storage performance, and depending on the workload, a system’s IOPS can be critical.

In practice, when a vendor advertises IOPS, they are really showcasing how many discontiguous 4 KiB reads or writes the system can handle under the worst-case scenario of fully random I/O. Measuring storage performance by IOPS is only meaningful if the workloads are IOPS-intensive (e.g., databases, virtualized environments, or web servers). But as we move into the era of AI, the question remains: does IOPS still matter?

A Breakdown of Your Standard AI Workload

AI workloads run across the entire data lifecycle, and each stage puts its own spin on GPU compute (with CPUs supporting orchestration and preprocessing), storage, and data management resources. Here are some of the most common types you’ll come across when building and rolling out AI solutions.

Article content
Figure 1: AI Workflows

Data Ingestion & Preprocessing

During this stage, raw data is collected from sources such as databases, social media platforms, IoT devices, and APIs (as examples), then fed into AI pipelines to prepare it for analysis. Before that analysis can happen, however, the data must be cleaned, removing inconsistencies, corrupt or irrelevant entries, filling in missing values, and aligning formats (such as timestamps or units of measurement), among other tasks.

Model Training

After the data is prepped, it’s time for the most demanding phase: training. Here, large language models (LLMs) are built by processing data to spot patterns and relationships that drive accurate predictions. This stage leans heavily on high-performance GPUs, with frequent checkpoints to storage so training can quickly recover from hardware or job failures. In many cases, some degree of fine-tuning or similar adjustments may also be part of the process.

Fine-Tuning

Model training typically involves building a foundation model from scratch on large datasets to capture broad, general knowledge. Fine-tuning then refines this pre-trained model for a specific task or domain using smaller, specialized datasets, enhancing its performance.

Model Inference

Once trained, the AI model can make predictions on new, rather than historical, data by applying the patterns it has learned to generate actionable outputs. For example, if you show the model a picture of a dog it has never seen before, it will predict: “That is a dog.”

How High-Performance File Storage is Affected

An HPC parallel file system breaks data into chunks and distributes them across multiple networked storage servers. This allows many compute nodes to access the data simultaneously at high speeds. As a result, this architecture has become essential for data-intensive workloads, including AI.

During the data ingestion phase, raw data comes from many sources, and parallel file systems may play a limited role. Their importance increases during preprocessing and model training, where high-throughput systems are needed to quickly load and transform large datasets. This reduces the time required to prepare datasets for both training and inference.

Checkpointing during model training periodically saves the current state of the model to protect against progress loss from interruptions. This process requires all nodes to save the model’s state simultaneously, demanding high peak storage throughput to keep checkpointing time minimal. Insufficient storage performance during checkpointing can extend training times and increase the risk of data loss.

Article content
Figure 2: AI workflows with access to a Parallel File System

It is evident that AI workloads are driven by throughput, not IOPS. Training large models requires streaming massive sequential datasets, often gigabytes to terabytes in size, into GPUs. The real bottleneck is aggregate bandwidth (GB/s or TB/s), rather than handling millions of small, random I/O operations per second. Inefficient storage can create bottlenecks, leaving GPUs and other processors idle, slowing training, and driving up costs.

Requirements based solely on IOPS can significantly inflate the storage budget or rule out the most suitable architectures. Parallel file systems, on the other hand, excel in throughput and scalability. To meet specific IOPS targets, production file systems are often over-engineered, adding cost or unnecessary capabilities, rather than being designed for optimal throughput.

Conclusion

In conclusion, AI workloads demand high-throughput storage rather than high IOPS. While IOPS has long been a standard metric, modern AI particularly during data preprocessing, model training, and checkpointing relies on moving massive sequential datasets efficiently to keep GPUs and compute nodes fully utilized. Parallel file systems provide the necessary scalability and bandwidth to handle these workloads effectively, whereas focusing solely on IOPS can lead to over-engineered, costly solutions that do not optimize training performance. For AI at scale, throughput and aggregate bandwidth are the true drivers of productivity and cost efficiency.

About the Author

Petros Koutoupis has spent more than two decades in the data storage industry, working for companies which include Xyratex, Cleversafe/IBM, Seagate, Cray/HPE and now, VDURA. In addition to his engineering work, he is a technical writer and reviewer of books and articles on data storage and open-source technologies and has previously served on the editorial board of Linux Journal magazine.

About VDURA

VDURA builds the world’s most powerful data platform for AI and high-performance computing, blending flash-first speed with hyperscale capacity and 12-nines durability, all delivered with breakthrough simplicity. Visit vdura.comfor more information.

The future of HPC and AI is still being written and VDURA is at the forefront, driving innovation.