Aurora's DAOS and Lustre excel in MLPerf Storage Benchmark for large-scale AI

announcements
MLPerf Storage

MLCommons has announced new results for its MLPerf Storage v2.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning workloads in an architecture-neutral, representative, and reproducible manner.

Aurora's DAOS and Lustre file systems achieved strong results in the MLPerf Storage Benchmark, demonstrating the capabilities needed for large-scale AI training in the exascale era.

MLCommons has announced the results of MLPerf Storage v2.0, a benchmark focused on evaluating how storage systems perform under the demands of large-scale machine learning workloads. Among the submissions, the U.S. Department of Energy’s (DOE) Argonne National Laboratory contributed results from its Aurora supercomputer, demonstrating the strengths of its DAOS and Lustre file systems in supporting the intense I/O requirements of modern AI training.

Aurora

With more than 60,000 GPUs and a large on-node memory capacity, Argonne’s Aurora supercomputer provides researchers with a powerful platform for advancing AI-driven science. (Image: Argonne National Laboratory)

Training massive AI models like LLaMA3-405B and LLaMA3-1T requires fast, reliable checkpointing, a critical process that periodically saves the model’s state to protect against hardware or software failures. These operations involve writing massive amounts of data in short timeframes, and any delays can lead to significant idle compute time and reduced efficiency.

MLPerf Storage Benchmark uses the Deep Learning I/O (DLIO) benchmark to simulate the I/O operations during AI training. The new AI checkpointing workloads introduced in the v2.0 release were developed by Huihuo Zheng, a computer scientist at the Argonne Leadership Computing Facility (ALCF), who is a core developer of DLIO and co-chair of the MLPerf Storage working group.

To demonstrate Aurora’s capability for such challenges, the ALCF, a DOE Office of Science user facility, submitted performance data for two storage systems:

  • DAOS, a high-performance, open-source object store, reached nearly 1 TB/s write throughput and 600 GB/s read throughput using a subset of 128 of the 1,024 DAOS servers on Auroraenabling a full checkpoint of the LLaMA3-405B model in under 10 seconds.
  • Lustre, which powers the ALCF’s 100 PB Flare system, delivered sustained read and write throughput of 400–600 GB/s, highlighting its capability to operate effectively at near-peak capacity for demanding workloads.

“Checkpointing plays a crucial role in protecting weeks of training progress when working with today’s largest models,” Zheng said. “The MLPerf benchmark provides a rigorous and realistic way to measure how storage systems handle these demands. Our results show that Aurora’s architecture delivers the performance and throughput required for large-scale AI training in the exascale era.”

By participating in MLPerf Storage v2.0, the ALCF continues to support the development of benchmarks and technologies that make scalable, fault-tolerant AI training possible on advanced high-performance computing systems.