
MLCommons has announced new results for its MLPerf Storage v2.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning workloads in an architecture-neutral, representative, and reproducible manner.
Aurora's DAOS and Lustre file systems achieved strong results in the MLPerf Storage Benchmark, demonstrating the capabilities needed for large-scale AI training in the exascale era.
MLCommons has announced the results of MLPerf Storage v2.0, a benchmark focused on evaluating how storage systems perform under the demands of large-scale machine learning workloads. Among the submissions, the U.S. Department of Energy’s (DOE) Argonne National Laboratory contributed results from its Aurora supercomputer, demonstrating the strengths of its DAOS and Lustre file systems in supporting the intense I/O requirements of modern AI training.
Training massive AI models like LLaMA3-405B and LLaMA3-1T requires fast, reliable checkpointing, a critical process that periodically saves the model’s state to protect against hardware or software failures. These operations involve writing massive amounts of data in short timeframes, and any delays can lead to significant idle compute time and reduced efficiency.
MLPerf Storage Benchmark uses the Deep Learning I/O (DLIO) benchmark to simulate the I/O operations during AI training. The new AI checkpointing workloads introduced in the v2.0 release were developed by Huihuo Zheng, a computer scientist at the Argonne Leadership Computing Facility (ALCF), who is a core developer of DLIO and co-chair of the MLPerf Storage working group.
To demonstrate Aurora’s capability for such challenges, the ALCF, a DOE Office of Science user facility, submitted performance data for two storage systems:
“Checkpointing plays a crucial role in protecting weeks of training progress when working with today’s largest models,” Zheng said. “The MLPerf benchmark provides a rigorous and realistic way to measure how storage systems handle these demands. Our results show that Aurora’s architecture delivers the performance and throughput required for large-scale AI training in the exascale era.”
By participating in MLPerf Storage v2.0, the ALCF continues to support the development of benchmarks and technologies that make scalable, fault-tolerant AI training possible on advanced high-performance computing systems.