Speeding up Nek5000 with Autotuning and Specialization

Jaewook Shin
Seminar

Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation to select the best-performing solution for a particular architecture. At a LANS seminar in May, I introduced compiler-based empirical performance tuning and presented my success of applying it to a dense matrix-multiply kernel for small, rectangular matrices.

In this talk, I will begin with a summary of the talk, and then present my recent progress since then. A major result is that I could use the same technique to tune a higher-level kernel which is a loop with a call to a dense matrix multiply routine for small matrices. The kernel performance is up to 82% of peak on an AMD Phenom processor. With the tuned higher-level kernel and the library of tuned matrix multiply routines produced earlier, the whole Nek5000 program achieves 21% speedup on 256 nodes of the Cray XT5 at Oak Ridge National Laboratory. Also, I will show the overheads and fluctuations in measurements and how I overcame them for this experiment.