Ken Claffey on NetApp AFX: Why AI Needs Purpose-Built Storage

The General Purpose Era Is Ending – And So Is General Purpose Storage: Why NetApp’s AFX Proves My Point

Ken Claffey
October 17, 2025

Ken Claffey, Oct 17, 2025 (LinkedIn Blog Post) – As the CEO of VDURA I wrote in September about “The General-Purpose Era Is Ending – So Is General-Purpose Storage,” building on Jensen’s point about the demise of general-purpose computing. I made the case that AI’s demands for extreme parallelism, at-scale performance, real-time access, and exabyte-scale operations outstrip what general-purpose storage was designed for. Legacy systems, built for mixed IT workloads like virtualization and databases at the TB or small PB scale, would attempt adaptations—adding disaggregation or AI certifications, but their core architectures would hold them back. NetApp’s AFX announcement at INSIGHT 2025 this week aligns closely with that prediction: it’s an extension of ONTAP, introducing disaggregation to target AI, but it carries forward the limitations of a platform not originally engineered for these high-performance storage workloads.

NetApp positions AFX as highly capable for AI, with claims of up to 4 TB/s throughput and strong enterprise integration. As one would expect, it does bring ONTAP’s mature features, like unified data management across public/private clouds with comprehensive snapshots and NVIDIA DGX SuperPod certification, which can appeal in hybrid environments at small scale. However, as I noted in my original post, evolving a general-purpose NAS foundation into a high-performance scale-out system reveals big trade-offs. AFX inherits ONTAP’s strengths in enterprise controls but also its constraints in areas like scaling mechanics, data protection, performance, and hardware efficiency. Below, I’ll ground this in specifics on performance, reliability, and efficiency (security follows similar patterns of inherited vs. native design, but I’ll focus where contrasts are clearest). This isn’t about dismissing their progress and giving them credit for at least acknowledging that the market is moving in a new direction (away from general-purpose), it’s about highlighting why AI requires purpose-built, natively high-performance approaches.

Performance: Solid Aggregate Specs as Hero Numbers, but Per-Node Efficiency and Density Shortfalls

My September blog emphasized that AI shatters balanced I/O models, needing sustained high throughput for massive datasets without bottlenecks both reads and writes. AFX achieves up to 4 TB/s cluster throughput (reads) and scales to over 1 EB, which is respectable for feeding GPU clusters via standard protocols like NFS/pNFS.

Breaking it down: This peak requires up to 128 controller nodes, equating to about 31 GB/s reads per controller node (4 TB/s ÷ 128). NetApp hasn’t published write specs, but based on ONTAP’s typical RAID overhead (e.g., parity calculations in aggregates), writes often run at ~2/3 of reads—around 20 GB/s per node (being generous here). The building block is a 2U AFX 1K controller (minimum of two required), plus a proprietary 2U NX224 NVMe HA enclosure and two 100GbE switches (84U total), using proprietary hardware for HA interoperability.

In contrast, market leaders like VDURA deliver over 60 GB/s reads and 40 GB/s writes per 1U standard server node in comparable configs, thanks to our shared-nothing design that avoids legacy controller bottlenecks. We achieve this with off-the-shelf hardware, scaling throughput linearly without specialized HA enclosures. Benchmarks show VDURA, Weka, and DDN outperforming ONTAP-based systems by 2x or more in per-node/rack density and sustained AI ops. AFX’s numbers might work for 10s–100s of GPUs in early-stage/preproduction enterprise AI but lag in raw efficiency for at-scale AI deployments/AI factories.

To illustrate the performance gap in a flash-only setup targeting 4 TB/s throughput:

NetApp AFX: To reach 4 TB/s requires the full 128 controllers (2U proprietary each) + associated enclosures, totaling ~256U just for controllers (plus enclosures for capacity but focusing on throughput scaling here).
VDURA: Requires ~67 nodes (1U standard servers with integrated NVMe SSDs, at 60 GB/s reads per node) = ~67U total for equivalent throughput.

This means VDURA needs ~48% fewer nodes and ~74% less rack space for the compute layer alone, highlighting the density advantages of shared-nothing architecture in AI environments.

Reliability: Local Protections Limit Scale-Out Resilience

I argued that general-purpose resilience, focused on HA pairs, doesn’t hold up under AI’s failure probabilities at scale. AFX uses ONTAP’s aggregate-level protection with local RAID via FlexVol volumes (tolerating limited drive failures per aggregate, as confirmed in the AFX datasheet DS-3466) and HA pairs for controller failover, claiming 99.9999% availability.

This metric applies per HA pair only, not the full cluster. At 128 nodes, the architecture relies on clustered networking and local-only data protection—no cluster-wide network erasure coding. Rebuilds and failovers/failbacks introduce significant downtime risks and performance degradation, and probability models show availability dropping below 99% in large deployments due to concurrent failures (e.g., MTTF calculations for 128 nodes increase outage odds exponentially). There’s no published durability rating like nines for the cluster; data remains vulnerable if a pair fails during rebuilds.

VDURA employs file-level network erasure coding across the cluster, enabling resilience to 3+ node failures with rapid distributed rebuilds. We offer multi-level EC options for up to 12 nines durability, verified by Hyperion for HPC and AI environments. This shared-nothing approach running on standard servers avoids HA-pair dependencies, maintaining uptime even at thousands of nodes. It’s a ground-up design for AI’s fault-tolerant needs, not an evolution from NAS-era pairing.

Efficiency: Higher Hardware Footprint and Costs Add Up

General-purpose platforms often require more overhead for “enterprise readiness,” as I pointed out leading to denser, costlier setups in AI contexts. AFX’s 4U block and proprietary controllers enable features like QoS but consume more power (e.g., dual 2U units per logical node) and rack space. Data reduction claims 5-10x, but mileage will vary with many AI datasets already reduced at the application level or not compressible at all. Also, all-flash mandates align with performance needs but lock in higher costs without flexible media blending.

VDURA uses 1U standard nodes (with integrated NVMe SSDs for flash-only setups) for 2-4x the performance per RU reducing acquisition cost and TCO by over 50% per benchmarks. No proprietary hardware cuts capex, and our network file-based EC minimizes rebuild overhead. In flash configurations, this translates to fewer racks, lower power (up to 40% savings), and simpler ops compared to AFX’s setup.

Closing Thoughts: Validation of the Shift, and What’s Next

AFX shows legacy vendors adapting as I forecasted, but the inherited foundation from HA pair architecture to low per-node efficiency metrics limits it for pure AI scale. It might be a viable option for ONTAP shops looking to dip their toe in the AI waters, yet for greenfield/production AI infrastructures demanding top efficiency and resilience, purpose-built systems like VDURA pull ahead. On that note, were focused on real advancements that further the gap, with innovations coming at SC25 next month. For added background you can read my full September post here: https://www.vdura.com/2025/09/29/the-general-purpose-era-is-ending-so-is-general-purpose-storage/

Whitepaper

Thor's Epic Data Lift

Newsroom

Become a Partner

VDURA Support Portal

Whitepaper

Thor's Epic Data Lift

Newsroom

Become a Partner

VDURA Support Portal

The General Purpose Era Is Ending – And So Is General Purpose Storage: Why NetApp’s AFX Proves My Point

Share this

Categories

Most Recent