HPC and AI environments, by their nature, tend to be large and quite complex. Whether it’s pushing the boundaries in life and physical sciences or fueling the convergence of engineering innovation and artificial intelligence, it takes many computers operating together to simulate, analyze, or visualize the problems at hand. The massive quantity of data and the access performance needed to keep all those computers busy can only be achieved by a true parallel file system, one that is easy to use, durable, and simply a fast, total-performance storage system.
The VDURA® PanFS® parallel file system delivers the highest performance among competitive HPC and AI storage systems at any capacity while eliminating the complexity and unreliability associated with traditional data storage solutions. PanFS accomplishes all this using commodity hardware at competitive price points.
PanFS orchestrates multiple computer systems into a single entity that serves your data to your compute cluster.
Through sophisticated software, multiple systems work together to support terabytes per second (TB/s) of data
being read and written by your applications. PanFS manages this orchestration without manual intervention, continuously balancing the load across those systems, automatically ensuring resilience, scrubbing the stored data for the highest levels of data protection, and encrypting the stored data to safeguard it from unwanted exposure.
In this document, we’re going to take a “breadth-first” tour of the architecture of PanFS, looking at its key components, then dive deep into the main benefits.
Three components work together to power the PanFS file system: director nodes, storage nodes, and the DirectFlow Client driver. Director nodes and storage nodes are computer systems dedicated to running PanFS software, and together they comprise the VDURA Data Platform. The DirectFlow Client driver is a loadable software module that runs on Linux compute servers (clients) and interacts with the director nodes and storage nodes to read and write the files stored by PanFS. Any required administration happens via the GUI or CLI running on a director node. There’s no need to interact with storage nodes or the DirectFlow Client driver – the director nodes take care of everything.
All director nodes and storage nodes work together to provide a single file system namespace that we call a
“realm”. The Linux compute servers running DirectFlow Clients that access a realm are not considered part of that realm; instead, they are considered clients.
PanFS explicitly separates the control plane from the data plane. Director nodes in PanFS are the core of the
control plane. Director nodes:
Director nodes are commodity compute servers with a high-speed networking connection, significant DRAM
capacity, and NVMe Flash for transaction logs.
Storage nodes in PanFS are the core of the data plane. They are the only part of the overall architecture that stores user data. Storage nodes are commodity systems, but they are models we’ve certified for their carefully balanced hardware architecture in terms of their HDD, SSD, NVMe, and DRAM capacities strength of CPU, networking bandwidth, etc
The DirectFlow Client driver is a loadable file system implementation for Linux systems that is installed on compute servers and used by your application programs like any other file system. It works with the director
nodes and storage nodes to deliver fully POSIX-compliant and cache-coherent file system behavior, from a single namespace, across all the servers in the compute cluster. All popular Linux distributions and versions are supported.
PanFS scales out both director nodes and storage nodes. For more metadata processing performance, add more
director nodes. For more capacity or more storage performance, add more storage nodes.
In scale-out storage systems like PanFS, there simply is no maximum performance or maximum capacity. To achieve more performance or more capacity, simply add a few more nodes. PanFS has been designed to provide linear scale-out, e.g., adding 50% more storage nodes will deliver 50% more performance in addition to 50% more
capacity.
PanFS is a parallel file system that can consistently deliver orders of magnitude more bandwidth than the standard NFS or SMB/CIFS protocols. Each file stored by PanFS is individually striped across many storage nodes, allowing each component piece of a file to be read and written in parallel, increasing the performance of accessing each file.
PanFS is also a direct file system that allows each compute server to talk over the network directly to all the
storage nodes. Conventional enterprise-class products will funnel file accesses through “head nodes” running NFS
or SMB/CIFS and then across a backend network to other nodes that contain the storage media (HDDs or SSDs)
that hold the file. This can create bottlenecks when data traffic piles up on the head nodes, plus it brings
additional costs for a separate backend network. In contrast, for each file that the application wants to access, the
DirectFlow Client on the compute server will talk over the network directly to all the storage nodes that hold that
file’s data. The director nodes are out-of-band, which makes for a much more efficient architecture and one that
is much less prone to the hotspots, bottlenecks, and erratic performance common to traditional scale-out NAS
systems.
PanFS makes use of its multiple storage nodes by assigning a map to each file which shows where all the striped
component parts of that file can be found, and which storage node holds each part. The DirectFlow Client uses
that map to know which storage nodes to access, directly and in parallel.
PanFS also uses Network Erasure Coding as part of that striping to ensure the highest levels of data integrity and
reliability. More detail on erasure coding and Component Objects in PanFS can be found later in this document.
The POSIX standard defines the semantics of modern file systems. It defines what a file is and what a directory is,
what attributes each has, and the open, close, read, write, and lseek operations used to access files. Billions of
lines of software have been written that leverage the POSIX standard for access to storage.
The DirectFlow Client provides the same semantics as a locally-mounted, POSIX-compliant file system. It ensures
that if some other process (possibly on another compute server) is writing to a file at the same time this process is
reading from it, this process will not read stale data. In file system terminology, PanFS has been engineered to
provide cache coherency across all the nodes running the DirectFlow Client.
PanFS supports access timestamps as well as modification timestamps on files.
While there is no maximum, any realm needs a minimum of three director nodes. The administrator should
designate three or five director nodes – the odd number makes it easier to break ties in voting – out of the total set of director nodes in the realm to be the rulers of the realm. This group of rulers is called the “repset” because
they each have an up-to-date, fully replicated copy of the configuration database for the realm. Those rulers elect
one of themselves to be the realm president who coordinates all the configuration, status, and failure recovery
operations for the realm. If the director node currently designated as the president were to fail, another member
of the repset would be immediately and automatically elected the new president. The configuration database is
kept up to date on all members of the repset via a distributed transaction mechanism.
Director nodes make themselves available to do any of a large set of operational and maintenance tasks the
president may need them to do. Those include managing volumes, being a gateway into PanFS for the NFS and/or
SMB/CIFS protocols, helping perform background scrubbing of user data, helping to recover from a failure of a
storage node, and helping to perform Active Capacity Balancing across the storage nodes in a storage set, among
others. The president’s decisions can change over time as circumstances change, e.g., moving a gateway or a
volume to a different director node. However, the president will only do that if the operation is fully transparent
to client systems.
The PanFS architecture not only delivers exceptionally high, but also consistent, performance for workloads that
include a wide range of file sizes and access patterns, as well as workloads that change significantly over time.
The effect is a dramatic broadening of the use cases that PanFS can support in HPC and AI environments
compared to other parallel file systems. All other parallel file systems require time-consuming and laborious
manual tuning and retuning as workloads change.
The random assignment of Component Objects to storage nodes spreads the load from file accesses across all
those nodes. In most PanFS installations, the number of storage nodes is much larger than the typical stripe width
of a file, so each file is likely to only share a few storage nodes with any other files. This greatly reduces the odds
of any one storage node becoming overloaded and impacting the performance of the whole realm. The result is
much more consistent system performance, no matter what workload is being requested by the compute servers
or how it changes over time. PanFS does this without any tuning or manual intervention.
Since scalable performance depends upon spreading all the files relatively evenly across the pool of storage
nodes, PanFS includes Active Capacity Balancing. If the balance is off by more than a set threshold value, the
realm president will ask the pool of directors to examine the utilization of all the storage nodes and transparently
move Component Objects from over-full storage nodes to underutilized ones while the realm is online.
Active Capacity Balancing is also used when new storage nodes are incorporated into a realm. Immediately after
they are added, the realm will begin using them to store parts of any newly created files. Since the utilization of
those new storage nodes is so much lower than the utilization of the existing storage nodes, Active Capacity
Balancing will begin moving Component Objects from existing storage nodes to the new storage nodes as a
background task. The new storage nodes will immediately start contributing to the performance of the realm and
will gradually pick up more and more of the realm’s workload until all storage nodes are contributing equally to
the overall performance of the realm again.
PanFS can be configured to create storage sets for different capacity classes of storage nodes to help evenly
spread the workload across the pool of storage nodes and avoid hotspots in your realm. For example, storage
nodes with a capacity of 280 TB each should not be combined with storage nodes with a capacity of 132 TB each
in the same storage set. This does not interfere realm expansion, as PanFS can support multiple storage sets in a
realm and in the same namespace at the same time. Adding storage nodes, whether in the same storage set or a
new one, is a non-disruptive operation performed while the file system is online.
In addition to the performance and reliability advantages that come from the overall PanFS architecture, there
are significant performance optimizations in the PanFS storage node software that enable the most efficient use of available storage media inside each storage node. PanFS is designed to handle combinations of up to four
different performance tiers, including Storage Class Memory such as CXL 2.0’s pmem, latency optimized NVMe
SSDs, capacity optimized SSDs, and HDDs.
Metadata is orchestrated and stored on the Director Node. Metadata is usually composed of very small records
(e.g., inodes) that are accessed in unpredictable patterns and are always latency sensitive. Directories are also
metadata and are latency sensitive, but they are often accessed sequentially (e.g., “/bin/ls”). As a result of being
small, typically having unpredictable access patterns, and being latency sensitive, metadata deserves a different
storage mechanism than files full of user data, which are typically much larger and accessed sequentially.
The storage node software stack stores hot data in a database in one of the higher tiers of storage drives, typically
an NVMe SSD, and stores bulk user file data in one of the lower tiers of drives, typically capacity-optimized SSDs
or HDDs. It uses the highest available tier of drives as a transaction log, committing the incoming data or
operations to stable storage, therefore allowing the application to continue its processing as quickly as possible.
The storage node software stack can also support the new generations of dual-actuator HDDs that offer roughly
double the bandwidth and IOPs per TB of capacity as single-actuator HDDs, resulting in performance levels that
are a step function above traditional HDDs.
The price/performance and mixed-workload performance of a storage system depends heavily upon how
effectively its architecture makes use of the performance of the underlying commodity storage devices. VDURA
uniquely excels at getting the most performance from all those devices.
PanFS takes advantage of the DRAM in each storage node as an extremely low-latency cache of the most recently
read or written data. Small Component Objects are stored on capacity-optimized SSDs that provide cost-effective
and high-bandwidth storage without any seek times. Any POSIX file of less than approximately 1.5 MB in size will
be fully stored on SSDs, with that size automatically adjusting itself up and down over time to ensure the SSDs are
fully utilized.
HDDs provide high-bandwidth data storage if they are never asked to store anything small and only asked to do
large sequential transfers. Therefore, only large Component Objects are stored on low-cost-per-TB HDDs.
To gain the most benefit from an SSD’s performance, we try to keep each SSD about 80% full. If an SSD falls below
that level, PanFS will, transparently and in the background, pick the smallest Component Objects from the next
slowest set of drives and move them to the SSD until it is about 80% full. If the SSD is too full, PanFS will move the
largest Component Objects on the SSD to the next slower tier of drives. Every storage node performs this
optimization independently and continuously. It’s easy for a storage node to pick which Component Objects to
move; it just needs to look in its local metadata database.
Storage nodes in PanFS are typically designed using an all-hot design principle. We first decide how many of each
type of storage device (NVMe SSDs, SSDs, HDDs) will be in a new storage node. We then ensure that there is
enough network bandwidth in and out of that storage node to keep every storage device in that node completely
busy all the time. And finally, we include appropriate CPU power and DRAM in each node to drive both the
network and the devices to full performance.
In contrast, other storage solutions may segregate their flash-based devices into a hot tier and the more cost-
effective HDDs into a cold tier to reduce the overall average cost per TB. Typically, the cold tier takes the form of
a heavyweight, separately administered cold archive (e.g., with an S3 interface), plus an additional, separately
administered product that will move data between the hot tier and the cold tier. Such tiered storage solutions
move files between their hot and cold tiers based on how recently a file has been accessed. This simplistic theory
assumes that if you haven’t accessed a file in a while, you won’t access it for a while longer. For some important
workloads such as AI, that simply isn’t true.
PanFS support for multiple storage media types forms an elegant storage platform that can automatically adapt
to changing file sizes and workloads, with all the underlying devices directly contributing to the application
performance you need. With heavyweight tiering, the hardware in the cold tier cannot contribute to the
performance the application needs, since an application can typically only directly access the hot tier. This
deficiency results in three types of costs: the stranded performance costs of the drives in the cold tier not being
able to contribute to application performance; the monetary costs of the additional networking and hot tier
storage performance required to move data between the hot and cold tiers without impacting application I/O
performance; and the direct costs of administering two separate tiers plus the policies required to move data
back and forth. Additionally, the cost of skilled employees to manage all of this is a significant added piece of the
overall TCO of HPC-class storage systems.
Training AI/ML models is an I/O intensive process at which PanFS excels. It is a read-dominated workload across a
large set of input files, with only an infrequent write out of a large “checkpoint” file. AI training requires
randomness in the order that the data files are accessed, and that randomness defeats the typical LRU caching of
metadata. As a result, storage architectures that are heavily dependent upon caching for their performance are at
a disadvantage. The focus of PanFS on separating metadata and storing it on low-latency drives offers good
performance, despite that randomness.
There is also a more direct use of AI in HPC, sometimes called “AI-powered”
, where the application is using AI
rather than the application being AI. For instance, the Navier-Stokes equation that is core to computational fluid
dynamics is extremely costly to calculate. A neural network can provide close approximations of the values that
Navier-Stokes would generate, but in a much less costly way. The approximations are used to narrow in toward
an optimal result, then the real Navier-Stokes can be used for a fine-tuned final result.
PanFS was designed at both the macro level of the overall architecture and the micro level of the storage node
software stack architecture to support mixed workloads well. It offers a wide performance profile rather than a
high but narrow peak.
As a result of its parallel and direct nature, the DirectFlow protocol will always be the highest performance path
to PanFS storage, but NFS and SMB/CIFS are still useful for laptops and/or workstations to access the results of
HPC jobs or for casual access to the data files stored in PanFS.
One of the roles of the director nodes in PanFS is to act as gateways that translate NFS, SMB/CIFS, and S3
operations into DirectFlow operations. PanFS provides high performance NFSv3, SMB/CIFS v3.1, and S3 access to
the same namespace and data files as the DirectFlow Client provides to the HPC compute cluster.
S3 protocol support in PanFS enables cloud-native applications that use S3 as their primary data access protocol
to store and retrieve data in PanFS. Objects stored in PanFS via the S3 protocol are visible as POSIX files in PanFS
via the other data access protocols (DirectFlow, NFS, and SMB/CIFS), and POSIX files stored in PanFS via any other
data access protocol are accessible as Objects via the S3 protocol.
PanFS supports two features that prevent unauthorized data access while the realm is online – Access Control
Lists (ACLs) and Security-Enhanced Linux (SELinux), and one that prevents unauthorized data access while the
realm is offline – encryption at rest.
PanFS supports ACLs on each file and directory in addition to the traditional Linux user ID, group ID, and mode
bits such as “joe dev -rwxr-xr-x”. PanFS ACLs are fully compatible with Windows ACLs and Active Directory-based and LDAP-based account management; they provide fine-grained control over which user accounts can execute
which operations on each file or directory.
Running SELinux on the client systems that access PanFS via DirectFlow provides a kernel-implemented set of
mandatary access controls that confine user programs and system services to the minimum level of file and data
access that they require. PanFS integrates the “security.selinux” security label that enables this protection into
the PanFS inode structure, resulting in near zero added access latency when SELinux is used. This is the
foundation for true Multi-Level Security policy implementations that allow users and data in different security
levels and compartments to share the same compute and storage resources.
All VDURA storage nodes use industry-standard self-encrypting drives (SEDs) which implement NIST-approved
AES-256 hardware-based encryption algorithms built into each drive. These drives are specifically designed so
that encryption does not reduce performance. PanFS allows encryption-at-rest to be non-disruptively enabled
and disabled, drive keys to be rotated upon command, and cryptographic erasure for securely repurposing the
storage without risking exposure of the prior data.
Key management is outsourced via the Key Management Interoperability Protocol (KMIP) to well-established and
proven cyber security key management solutions that provide centralized and simplified key management. During
normal runtime, PanFS periodically verifies that the KMIP server is alive and well and contains all the right keys
and will raise an alert if there is any type of problem.
PanFS enables a hybrid model of using an on-prem compute cluster that leverages compute resources in the
public clouds, whether that is the latest GPU AI/ML accelerators such or simply on-demand scalable compute
resources.
For applications that are best run in the cloud, the PanMove feature of PanFS can easily, quickly, and efficiently
move any number of data files between on-prem HPC deployments and AWS, Azure, or Google, plus second-tier
cloud providers that support the S3 protocol. Locally acquired data files can be moved up for processing and
results files can be moved back for further processing or storage.
For cloud-native applications that use S3 as their primary data access protocol and that can also be run in a local
HPC cluster, the S3 data access protocol support in PanFS enables them to use PanFS for their data storage needs.
VDURA does not yet support deployment of PanFS in the cloud, but we’re researching our options on how to best
implement it.
In addition to being responsible for the POSIX semantics and cache coherency of the files in a volume, director
nodes also need to manage the status and health of each of the storage and director nodes that are part of the
realm. VDURA has analyzed the failure modes of each of the commodity platforms on which PanFS runs, and we
have included recovery logic for each of those cases into the director node software stack. That additional
engineering work is a significant contributor to the overall reliability of a PanFS realm and is one of the keys to its
low-touch administration. PanFS automatically reacts to failures and recovers from them, taking care of itself.
Storage nodes in PanFS are highly sophisticated Object Storage Devices (OSDs), and we gain the same scale-out
and shared-nothing architectural benefits from our OSDs as any Object Store would. The definition of an Object
used in our OSDs comes from the Small Computer System Interface (SCSI) standard definition of Objects rather
than the Amazon S3 Object definition.
PanFS uses SCSI Objects to store POSIX files, but it does so differently than how S3 Objects are typically used to
store files. Instead of storing each file in an Object like S3 does, PanFS stripes a large POSIX file across a set of
Component Objects and adds additional Component Objects into that stripe that store the P and Q data
protection values of an N+2 erasure coding scheme. Using multiple Objects per POSIX file enables the striping that
is one of the sources of a parallel file system’s performance.
While large POSIX files are stored using erasure coding across multiple Component Objects, small POSIX files use
triple-replication across three Component Objects. This approach delivers higher performance than can be
achieved by using erasure coding on such small files while being more space efficient. Unless the first write to a
file is a large one, it will start as a small file. If a small file grows into a large file, the director node will
transparently transition the file to the erasure coded format at the point that the erasure coded format becomes
more efficient.
When a file is created, and as it grows into a large file, the director node that manages those operations will
randomly assign each of the individual Component Objects that make up that file to different storage nodes. No
two Component Objects for any file will be in the same failure domain.
Any system can experience failures, and as systems grow larger, their increasing complexity typically leads to
lower overall reliability. For example, in an old-school RAID system, since the odds of any given HDD failing are
roughly the same during the current hour as they were during the prior hour, more time in degraded mode equals
higher odds of another HDD failing while the RAID system is still degraded. If enough HDDs were to be in a failed
state at the same time, there would be data loss, so recovering back to full data protection levels as quickly as
possible becomes the key aspect of any resiliency plan.
If a VDURA storage node were to fail, PanFS must reconstruct only those Component Objects that were on the
failed storage node, not the entire raw capacity of the storage node like a RAID array would. PanFS would read
the Component Objects for each affected file from all the other storage nodes and use each file’s erasure code to
reconstruct the Component Objects that were on the failed node.
When a storage set in PanFS is first set up, it sets aside a configurable amount of spare space on all the storage
nodes in that storage set to hold the output from file reconstructions. When PanFS reconstructs a missing
Component Object, it writes it to the spare space on a randomly chosen storage node in the same storage set. As
a result, during a reconstruction, PanFS uses the combined write bandwidth of all the storage nodes in that
storage set. The increased reconstruction bandwidth results in reducing the total time to reconstruct affected
files, which reduces the odds of an additional failure during that time and increases the overall reliability of the
realm.
PanFS also continuously scrubs the data integrity of the system in the background by slowly reading through all
files in the system, validating that the erasure codes for each file match the data in that file.
The N+2 erasure coding that PanFS implements protects against two simultaneous failures within any given
storage set without any data loss. More than two failures in a realm can be automatically and transparently
recovered from, as long as there are no more than two failed storage nodes at any one time in a storage set.
If, in extreme circumstances, three storage nodes in a single storage set were to fail at the same time, PanFS has
one additional line of defense that would limit the effects of that failure. All directories in PanFS are stored quad-
replicated – four complete copies of each directory, no two copies on the same storage node – rather than the
triple-replicated or erasure coded formats used for regular files.
If a third storage node were to fail in a storage set while two others were being reconstructed, that storage set
would immediately transition to read-only state. Only the files in the storage set that had Component Objects on
all three of the failed storage nodes would have lost data. All other files in the storage set would be unaffected or
recoverable using their erasure coding. The number of affected files in these situations becomes smaller as the
size of the storage set increases.
Since PanFS would have one complete directory tree still available to it, it can identify the full pathnames of
precisely which files need to be restored from a backup or reacquired from their original source and can therefore
also recognize which files were either unaffected or recovered using their erasure coding.
PanFS is unique in the way it provides clear knowledge of the impact of a given event, as opposed to other
architectures which leave you with significant uncertainty about the extent of the data loss.