VDURA: Velocity meets durability

White Paper

Data Storage Infrastructure for AI & HPC

Built for AI Factory Production Needs

VDURA’s Hyperscale Inspired HYRDA Architecture

Runtime

5 minute read

Audience

AI & HPC leaders, architects, DevOps

Primary themes

Performance · Economics · Simplicity

Executive Summary

In AI factories where every hour of GPU idle time costs millions and production pipelines must run 24/7, you need infrastructure that delivers consistent performance, infinite scale, and unbreakable reliability without constant tuning or fragility.

VDURA’s HYDRA architecture: High-Performance, Yield-Optimized, Distributed, Resilient Architecture is a hyperscale-inspired, software-defined platform that evolves the proven PanFS parallel file system into a unified data platform combining high-velocity parallel file access with object-grade resilience and intelligent mixed-fleet tiering.

Think of it as a reliable fleet of high-performance vehicles powering your AI production line: dependable, scalable from compact to heavy-duty models, self-managing failures, and optimized for long-term efficiency, rather than temperamental race cars that demand constant attention.

HYDRA powers the VDURA Data Platform (VDP) V11, delivering terabytes-per-second throughput potential, linear scalability to exabytes and thousands of nodes, and up to 12 nines durability all through pure software on commodity hardware. It addresses the full AI pipeline, from ingest and training to checkpointing and inference, while minimizing TCO amid SSD price pressures.

This White Paper explores architecture, components, and real-world benefits for Neoclouds and AI Factories.

VDURA Scales Because Innovation Scales

The platform layers scalable, software-defined components to support diverse AI and HPC workloads seamlessly:

HYDRA Architecture

Built upon the true parallel file system technology of PanFS for linear scaling, native mixed-fleet (NVMe flash + SATA HDD) support, intelligent tiering, and automated resilience.

Distributed Microservices

Enables modular deployment across a broad ecosystem of on-premises hardware or cloud environments.

API-Driven Observability

Monitors and optimizes performance and capacity balancing with proactive healing.

Our software-defined model dynamically provisions resources, ensures high-availability of data and metadata, and orchestrates data resiliency across commodity servers delivering hyperscaler efficiency to every organization.

The AI Pipeline Challenge

AI workloads redefine storage demands. Each pipeline stage has unique I/O profiles, and legacy systems create bottlenecks, GPU stalls, metadata overload, flash waste, and manual tuning overhead.

Key Challenges

GPU idle time from slow ingest and checkpointing
Metadata overload during model versioning and small-file inference
Costly overprovisioning of NVMe SSDs
Complicated, slow, and inefficient legacy tiering approaches
Recovery disruptions in large-scale and multi-model failure events

HYDRA addresses these with direct parallel access, VeLO metadata acceleration, automated tiering, and self-healing, ensuring sustained GPU saturation and uninterrupted production.

AI Pipeline Stages & Optimizations

Stage	Key Requirements	HYDRA Solution
Data Ingest	High-volume writes, capacity	Parallel ingestion + HDD tiering
Model Load / Training	High throughput reads/writes	NVMe flash prioritization + massive parallel I/O
Checkpointing	Very high burst writes	Fast parallel writes with linear scaling, MLEC recovery
Fine-Tuning / Inference	Low-latency reads, small files	VeLO for billions ops/sec, flash acceleration
Archive / Retention	Cost-efficient capacity	HDD expansion + intelligent data placement

The AI Training and AI-Powered

HYDRA is our hyperscale-inspired architecture powering VDP V11, with PanFS as the foundational parallel file system significantly enhanced for AI-era demands. The architecture consists of three core planes:

Client Library (DirectFlow)

POSIX-compliant parallel driver with cache coherence for seamless integration with AI/HPC workloads. Employs client-side erasure coding on a per file basis as it is striped across N number of VPOD storage nodes. The erasure coding is dynamic based on parameters such as file size (small files are triplicated for performance; larger files use RAID6+ schemes) and communicates with the Control Plane Directors to determine optimal layout across VPODs.

Control Plane

Scalable distributed metadata service supporting direct client communication for low-latency operations.

Metadata services are provided by Director nodes — stateless control plane services running on dedicated nodes (minimum 3, with replicated quorum and realm president election for instant failover and automated recovery).
Directors orchestrate metadata services via VeLO instances (Velocity Layered Operations), which are distributed across each physical storage node in the cluster for horizontal scalability and resilience.
Directors coordinate between Storage Nodes and the DirectFlow clients, automating recovery, balancing, scrubbing, and maintaining resilience without data path contention.

Data Plane

Direct parallel client-to-node communication flows from the Client to the VPODs (Virtualized Protected Object Devices) on the storage nodes, which house commodity flash and HDD media.

VPODs are discrete, virtualized storage units forming the erasure-coded foundation for data placement and protection.
Each VPOD acts as a self-contained, protected object container that abstracts the underlying physical media (NVMe flash, SATA HDDs, or mixed combination), enabling granular control over data placement, protection, and performance without tying the system to specific hardware layouts.
Optional Multi-level N+2 coding provides up to 12 nines durability in all-flash or 11 nines in hybrid configurations, supporting failure isolation and linear scalability and minimizing network (east-west) communications during data reconstruction.
Each VPOD contains a high-performance embedded parallel data mover, unlocking seamless end-to-end performance alignment across tiers.

Rock-Solid, Economical, Scalable Storage

A single cluster scales to exabytes and thousands of nodes with shared pools enabling linear performance and expandability. Disaggregation allows provisioning flash for peaks then balancing HDD capacity, maximizing GPU utilization and efficiency.

Shared-Nothing Architecture

HYDRA operates on a true shared-nothing principle, placing all availability and fault tolerance in software rather than relying on any intrinsic high-availability features of the underlying hardware. This eliminates the need for specialized HA-pair servers, firmware-based RAID controllers, dual-ported SSDs or HDDs, or other specialized hardware designs.

By managing resilience entirely through software (Network File Level Erasure Coding, automated self- healing, and Director-led orchestration), HYDRA enables the use of standard, off-the-shelf commodity servers and storage devices.

Complete Hardware Freedom

Run on any compatible commodity servers and drives, avoiding vendor lock-in and enabling multi- vendor sourcing for best price/performance.

Hyperscaler-Grade Media Support

Single-port NVMe SSDs (TLC/QLC) and single-ported SATA HDDs (CMR/SMR/HAMR/HBHDD), unlocking the lowest unit costs and maximum supply chain flexibility.

Software-Defined SLAs

Performance, capacity, durability, availability and additional QOS attributes are governed by policy- driven tiers and software resilience via API’s, not complicated HA hardware configurations, making the system simpler to deploy, expand, and operate at scale.

For AI factories, this means infrastructure that scales economically and predictably: add commodity nodes as needed, let HYDRA deliver consistent performance and handle failures transparently, and focus your resources on training models, not managing storage hardware.

Maximizing Storage Efficiency

Managing resources without overprovisioning is critical in AI factories, where SSD costs have surged up to 16x higher than HDDs due to AI demand. HYDRA leverages access patterns with native mixed media support to deliver optimal economics without performance trade-offs.

Hot AI data lands on flash for low-latency access, with just enough capacity to meet peak I/O demands. Colder data automatically shifts to high-capacity disks via Dynamic Data Acceleration (DDA) and API-aware placement based on file size, temperature, and access patterns. Intelligent, continuous balancing keeps drives full and busy, eliminating hotspots and waste.

The VDURA Advantage

VDURA is the only high-performance, software-defined solution that seamlessly combines NVMe flash and SATA HDDs in a single global namespace, control and data plane with full parallel performance and resilience across tiers. This is enabled by the shared-nothing architecture—the same model used by hyperscalers like Google with Colossus.

In contrast, legacy HA-pair architectures or shared-everything systems impose hardware dependencies that lock customers out of the most cost-effective and innovative single-ported devices. HYDRA’s design eliminates those constraints, delivering hyperscaler-level economics and supply chain resilience, democratized for every AI factory.

Mixed Fleet + Intelligent Tiering

HYDRA uniquely combines parallel file system speed with object resilience on evolved PanFS:

Multi-vendor commodity hardware freedom
Native mixed-fleet (NVMe + SATA HDD) with intelligent tiering
High-performance policy-driven optimization
VeLO scalable metadata (billions of INODEs)
VPODs with MLEC for extreme durability
Fault tolerance & automated self-healing for uninterrupted services
Multi-protocol (POSIX/DirectFlow, NFS, SMB, S3, CSI)
True End-to-End encryption (AES-256, KMIP)
Online upgradeability & zero-downtime operations
Field-proven scale (1,000+ deployments, linear scale to thousands of nodes)

Platform Capabilities at a Glance

Capability	Specification
Throughput	Terabytes/sec potential with linear scaling
Scalability	Exabytes capacity, thousands of nodes per cluster
Durability	Up to 12 nines (all-flash), 11 nines (hybrid)
Availability	Six nines+ in production environments
Media Support	NVMe SSD (TLC/QLC) + SATA HDD (CMR/SMR/HAMR)
Protocols	POSIX/DirectFlow, NFS, SMB, S3, CSI
Encryption	AES-256 end-to-end, KMIP key management
Metadata	VeLO: billions of INODE operations/sec
Self-Healing	Automated recovery, scrubbing, balancing
Deployment	1,000+ production deployments worldwide

Battle-Tested Reliability at Scale

True maturity in storage architecture isn’t born in a lab or overnight, it’s forged in the real world over years, where unpredictable corner cases, rare exceptions, and perfect-storm events can expose even the most carefully designed systems. No simulation can fully replicate the chaos of production at scale: correlated failures from batch defects, firmware quirks, environmental interactions, or cascading issues during peak AI training loads.

Industry consensus holds that distributed systems such as parallel file systems typically require a decade or more of real-world iterations to reach operational maturity and stability at scale in business- critical environments (often evolving through major epochs every 10 or so years). PanFS’s 25+ years of continuous refinement exemplify this hard-won resilience, delivering proven reliability for AI factories.

VDURA’s HYDRA architecture carries decades of production-proven maturity from PanFS’s 25+ year legacy, trusted across thousands of deployments in the world’s most demanding environments including leading research labs, service providers, government institutions, and Fortune 500 enterprises. These systems have accumulated tens of millions of cumulative runtime hours in live production clusters.

Every unforeseen failure mode encountered, every edge case that could have led to unavailability or data loss, has been diagnosed, hardened, and turned into automated safeguards. This hard-earned resilience is why HYDRA delivers six nines+ availability in practice: not from assumptions, but from real deployments that have survived the unpredictable.

Conclusion

HYDRA evolves PanFS to power AI factories with dependable high performance, infinite scale, and unbreakable reliability proven where it matters most: in the field.

With automated resilience ensuring uninterrupted operations, build confidently on VDURA. We’ll keep advancing the platform, so you don’t have to.

Ready to Transform Your AI Infrastructure?

Download the full VDURA Data Platform V11 White Paper or visit vdura.com for a tailored AI factory assessment.

Continue the conversation

Translate the white paper into your roadmap

Need a deeper dive into architecture, proof-of-concept planning, or sizing? The VDURA team will tailor the V11 guidance to your AI pipeline and adoption timeline.

VDURA DATA PLATFORM V11

AMD architecture: VDURA and AMD Introduce Industry’s First Scalable GPU Architecture

VDURA Unveils Data Platform V12

Become a Partner

VDURA Support Portal