Pushing the boundaries of HPC and AI: A Q&A with Varuni Katti Sastry

staff
Varuni Sastry

Varuni Sastry. Image: Argonne National Laboratory

Sastry discusses her involvement in the AuroraGPT project, supporting the ALCF AI Testbed, and contributing to Gordon Bell-finalist project MProt-DPO.

As a key member of the AI Testbed team at the Argonne Leadership Computing Facility (ALCF), Varuni Katti Sastry is advancing the use of novel AI accelerators to drive scientific discovery. Her work spans performance benchmarking, infrastructure development, and hands-on user support—helping researchers deploy complex AI workloads on cutting-edge architectures like the ALCF’s Cerebras CS-2, SambaNova, and Groq systems. 

In addition to leading technical efforts at the ALCF AI Testbed, Sastry has played a central role in several high-impact initiatives, including the AuroraGPT project, where she helped train a large-scale, science-focused foundation model on the ALCF’s exascale supercomputer. She also contributed to the Gordon Bell-finalist project MProt-DPO, which applies large language models and molecular simulations to protein design, and is developed CALMS, a retrieval-augmented generation (RAG) framework designed to support scientific facilities through AI-powered chat interfaces.

In this Q&A, Sastry shares insights into supporting AI at scale, collaborating across scientific domains, and building the infrastructure and community needed to accelerate innovation in HPC and AI. The ALCF is a DOE Office of Science user facility at the U.S. Department of Energy’s Argonne National Laboratory.

Q: You’ve been involved in supporting the ALCF AI Testbed. Can you tell us about the AI Testbed and your focus there?

The ALCF AI Testbed houses a number of next-generation accelerators that are dataflow-based architecture specifically designed and optimized for AI workloads. Currently this testbed infrastructure offers allocations on Cerebras CS-2, SambaNova DataScale, Groq, and Graphcore. Being a core member of AI Testbed team, I am responsible for performance benchmarking a wide range of AI workloads and support users in deploying scientific AI workloads on these systems. Our team’s responsibilities include building, maintaining, and expanding the infrastructure, as well as exploring building future heterogenous computing systems by integrating these resources with conventional GPU-based systems. We organize yearly tutorials and workshops at prominent computing conferences like SC and ISC to share best practices and showcase the unique capabilities of these systems to the AI and HPC community.

Q: Your role involves helping users deploy scientific AI workloads on ALCF systems, including the novel accelerators in AI Testbed. What are some of the biggest challenges or interesting experiences you had supporting users?

One of the biggest challenges in supporting users has been navigating the limitations in kernel, software, and compiler support, especially when working with cutting-edge hardware or novel architectures. Many scientific applications rely on mature software stacks that are not always compatible with newer systems. This often requires customizing the vendor software, or even modifying user code to ensure compatibility and performance. Another recurring challenge is performance tuning. Scientific AI workloads can be highly diverse, and achieving optimal performance often requires a deep understanding of both the workload characteristics and the underlying hardware. 

Despite these challenges, these experiences have provided opportunities to work closely with users on real-world problems, often requiring creative debugging, collaborative problem-solving, and pushing the boundaries of what's possible on HPC and AI systems. Each success deepens our understanding of the evolving landscape, helps us expand our infrastructure to meet growing user needs, and enhances our support strategies for future users.

Q: In 2024, you joined the AuroraGPT pre-training team. What is AuroraGPT, and what has your role on the team involved?

AuroraGPT is a research project at Argonne aimed at the developing large language foundation models for advancing science. The goal of AuroraGPT is to build the infrastructure necessary to train, evaluate, and deploy foundation models at scale for scientific research, using DOE’s leadership computing resources. In particular, the pre-training team which I am a part of, pre-trained a 7B llama style GPT model using up to 2T tokens of Dolma dataset on both Aurora and Polaris.  Our data science team at ALCF built the necessary data preprocessing and training infrastructure to train AuroraGPT at scale.

Q: You helped set up a scalable processing pipeline for AuroraGPT. What were some of the unique challenges in preparing data for such a large-scale model?

The effectiveness of large-scale AI model training critically depends on several key factors related to data preparation and processing. One of the foundational decisions is the choice of the dataset for pre-training, which must be aligned with the model's intended scientific downstream tasks. Data quality is equally vital, as noisy, inconsistent, or irrelevant content can significantly impair model performance.  We utilized the open-source Dolma dataset, which met these criteria, and adjusted the weighting of its components to better reflect the scientific content. At scale, effective tokenization becomes essential for transforming raw text into trainable formats while preserving linguistic structure and semantic meaning. We explored various tokenization strategies and developed a scalable, high-performance tokenization pipeline. To further support the demands of large-scale data processing, we built a robust and scalable data processing engine based on the Megatron-Deepspeed framework, enabling high-throughput data loading and efficient training.

You were a co-author on the MProt-DPO study, which was named a 2024 Gordon Bell finalist. What is MProt-DPO, and why is this work significant?

MProt-DPO is a scalable multi-modal computational framework for protein design where protein sequences and structures are augmented with natural language descriptions of their biochemical properties to train generative models, adopting Direct Preference Optimization (DPO) to enable the framework to learn from experimental feedback and simulations. The workflow integrates molecular dynamics (MD) simulations, protein language models, and experimental observations. Tested on yeast protein HIS7, using experimental data to improve the performance of various mutations, as well as using simulation data, we optimized the design of the malate dehydrogenase (MDH) to improve its catalytic efficiency. This framework demonstrated its ability to handle any generalized complex protein design challenges including broader implications for protein engineering and developing biological therapeutics.  As part of the ALCF team, we built and trained a Megatron-Deepspeed based LLM framework on multiple supercomputers and scaled the training to 3200 nodes on Aurora achieving a maximum sustained performance of 4.11 ExaFLOPS and peak performance of 5.57 Exaflops.

Q: You’ve been working on a large language model-based chat framework that uses retrieval-augmented generation (RAG) to support scientific facilities. Can you tell us more about that project. 

This is an inter-facility collaborative work at Argonne, where we set up a LLM based chat-style RAG framework called Context-Aware Language Model for Science (CALMS) to assist scientists with instrument operations and complex experimentation. With the ability to retrieve relevant information from documentation from various facilities, CALMS can answer simple questions on scientific capabilities and other operational procedures, as well as prototyped how interfacing with software tools and experimental hardware can conversationally operate scientific instruments. This work was published in NPJ  Computational Materials

Beyond your research, you’ve organized workshops, tutorials, and hackathons, and served as a reviewer for programs like the Argonne Training Program on Extreme-Scale Computing (ATPESC) and Innovative and Novel Computational Impact on Theory and Experiment (INCITE). Why is community involvement important to you, and how has it influenced your perspective on the AI and HPC communities?

Community involvement has always been an integral part of how the ALCF staff engages with the research user community at large. Organizing workshops, tutorials, and hackathons has given me valuable insights into the complexity of scientific workloads, along with the unique HPC challenges they present. Serving as a reviewer to programs like ATPESC and INCITE has enhanced my appreciation of how swiftly the community adapts to emerging technologies. The student and user community outreach programs like the AI for science student training series has helped me see the evolving landscape of scientific computing more holistically—not just from the perspective of technical challenges, but also from the lens of mentorship, accessibility, and collaboration. Staying engaged with the community keeps me grounded, future-focused, and continuously learning.  I believe it’s vital for advancing both AI and HPC.