Accelerating AI Training and Inference for Science on Aurora: Frameworks, Tools, and Best Practices

Webinar
may dev session

Join us on May 28, 2025, for an overview of key AI frameworks, toolkits, and strategies to achieve high-performance training and inference on Aurora for scientific applications. Members of ALCF's AI/ML team will cover examples of using PyTorch and TensorFlow on Aurora, followed by distributed training at scale using PyTorch with Distributed Data Parallel (DDP) and TensorFlow with Horovod, all driven by the oneCCL communication library.

Additionally, the speakers will discuss using Python on Intel's GPUs with Data Parallel Extensions for Python (DPEP). To maximize GPU performance, the webinar will share best practices for profiling codes and identifying bottlenecks. 

GitHub resources for this session can be found here. Azpz content can be found here and AI at Scale can be found here