Runtime
5 minute read
Audience
AI & HPC leaders, architects, DevOps
Primary themes
Performance · Economics · Simplicity
When specifying and prioritizing requirements for new on-premises for HPC-AI systems, users almost universally place a performance-related metric at the top of their list. These metrics are typically defined as some combination of the following factors such as: raw bandwidth, throughput, and/or latency; price/performance; performance improvement target compared with existing user domain-specific application benchmarks; time to results or time to science. All elements of a system’s architecture, including computing, networking, and the storage or data platform, are encompassed by the above factors.
Users, however, should be concerned with more than only performance for their solutions, especially relative to the data platform, as the amount of data required to deliver value from AI and modeling and simulation workloads is exponentially increasing. Secure, reliable, and performant storage doesn’t just happen. Best-in-class storage solutions require a deep understanding of items beyond performance. Items such as reliability, availability, serviceability, usability, and installability (colloquially referred to as RASUI or “the -abilities”) are equally as critical as performance. More recently, durability, or the confidence that data remains unchanged and free from corruption or loss, has emerged as an additional -ability to consider in determining leadership criteria for HPC-AI storage systems.
TABLE 1
HPC-AI Broader Market On-premises Revenue Forecast 2020-2026
| ($M) | 2022 | 2023 | 2024 | 2025 | 2026 | 2027 | 2028 | CAGR 23-28 |
|---|---|---|---|---|---|---|---|---|
| Server | $18,805 | $20,735 | $25,390 | $29,559 | $33,699 | $37,797 | $41,777 | 15.0% |
| Storage | $6,380 | $6,282 | $7,692 | $8,745 | $9,771 | $10,738 | $11,846 | 13.5% |
| Middleware | $1,781 | $1,711 | $2,026 | $2,241 | $2,468 | $2,691 | $2,968 | 11.6% |
| Applications | $5,069 | $4,830 | $5,684 | $6,267 | $6,878 | $7,468 | $8,240 | 11.3% |
| Service | $2,214 | $2,014 | $2,262 | $2,411 | $2,498 | $2,696 | $2,973 | 8.1% |
| Total Revenue | $34,250 | $35,573 | $43,054 | $49,223 | $55,315 | $61,390 | $67,805 | 13.8% |
Source: Hyperion Research, 2024
The “-abilities” examined in this white paper are reliability, availability, and durability. The terms are related, but not interchangeable.
Reliability
Reliability refers to how often a system experiences failure. Storage systems can be designed with varying degrees of redundancy such that they are still functional while parts of the system have failed. Reliability is typically measured as Mean Time Between Failure (MTBF) in hours.
Availability
Availability refers to the ability to get to the data. Factors that impact a system’s availability include redundancy and how long it may take to recover from a failure mode (Mean Time to Recover, or MTTR). In order for a system to be available, some form of redundancy must be designed into the system. There is typically some amount of performance degradation and/or loss of redundancy while a system is operating under a failure condition (e.g., a disk drive fails). Availability is typically expressed as the percentage of time it is operational in a year in terms of “9s”. Table 2 provides a model that describes
availability.
TABLE 2
Representation of Availability
| # of Nines | Annual % Uptime | Minute Online per Year | Minutes Offline per Year |
|---|---|---|---|
| 1 | 90% | 473,040 | 52,560 (36.5 days) |
| 2 | 99% | 520,344 | 5,256 (3.7 days) |
| 3 | 99.9% | 525,074 | 525.6 (8.8 hours) |
| 4 | 99.99% | 525,547 | 52.6 (< 1 hour) |
| 5 | 99.999% | 525,595 | 5.3 |
| 6 | 99.9999% | 525,599 | 0.5 |
Source: Hyperion Research, 2024
Durability
Durability refers to the data existing on the storage media as it was written when the application reads the data. In other words, there has been no data corruption or data loss. Durability is typically measured as a probability for data loss (data loss incurred by a solution across a configuration of systems within the solution) or mean time to data loss (MTTDL). Table 3 describes a model for understanding
durability.
TABLE 3
Representation of Durability
| # of Nines | Percentage | Data Loss within a Defined System |
|---|---|---|
| 1 | 90% | 1 object* with data loss within a configuration of 10 objects |
| 2 | 99% | 1 object with data loss within a configuration of 100 objects |
| 3 | 99.9% | 1 object with data loss within a configuration of 1,000 objects |
| 4 | 99.99% | 1 object with data loss within a configuration of 10,000 objects |
| 5 | 99.999% | 1 object with data loss within a configuration of 100,000 objects |
| 6 | 99.9999% | 1 object with data loss within a configuration of 1,000,000 objects |
| 7 | 99.99999% | 1 object with data loss within a configuration of 10,000,000 objects |
| 8 | 99.999999% | 1 object with data loss within a configuration of 100,000,000 objects |
| 9 | 99.9999999% | 1 object with data loss within a configuration of 1,000,000,000 objects |
| 10 | 99.99999999% | 1 object with data loss within a configuration of 10,000,000,000 objects |
| 11 | 99.999999999% | 1 object with data loss within a configuration of 100,000,000,000 objects |
Note: *An object is defined as the lowest common denominator unit that can be lost that will contribute to data loss, depending on the level and type of redundancy and protection. Within a storage system, for example, an object could be a file or a complete storage system.
Source: Hyperion Research, 2024
TABLE 4
" -ability" Definitions and Metrics
| Term | Definition | Typical Metrics |
|---|---|---|
| Reliability | The probability that a storage system will function correctly without failure during a specific period. | Mean Time Between Failure (MTBF) |
| Availability | The ability to get to the data. The percentage of time a storage system is operational and accessible for use. It is often expressed as a percentage of uptime, such as 99.999% (five nines) | % of uptime, often expressed a "9s" (e.g.,99.999%, or 5 9s) |
| Durability | The ability of the data to last. The ability of a storage system to preserve data without loss or corruption over time. | Data durability percentage (e.g., 99.99995%) Data loss probability (e.g., 0.00005% chance of loss) Mean Time to Data Loss (MTTDL) |
Source: Hyperion Research, 2024
TABLE 5
Comparison of RAID and Erasure Coding Redundancy Methods
| Criteria | RAID | Erasure Coding |
|---|---|---|
| Flexibility | Moderate (limited to predefined RAID levels) | High (customizable protection schemes) |
| Scalability | Limited (typically bound to a single array or node) | High (can scale across multiple nodes or locations) |
| Performance | Good (for small-scale systems; can be a bottleneck in large systems) | Moderate to high (depends on implementation) |
| Cost | Moderate (may require specialty hardware for optimal performance) | Low to moderate (can be implemented on commodity hardware) |
| Risk of data loss | Low to moderate (depending on RAID level) | Very low (especially in distributed systems) |
| Durability | Good (protects against drive failures) | Excellent (can protect against media, node, or site failures) |
| Resource Utilization | Moderate (fixed overhead based on RAID levels) | Efficient (customizable overhead) |
Source: Hyperion Research, 2024