Throughput Machine Learning with Heterogeneous resources (NAIRR240335)
Abstract
Advancing domain science through machine learning requires an ensemble of models, not a single model. To understand how ML can be used for a scientific problem and rigorously test the sensitivity of results, researchers must iterate through multiple model designs. This creates a throughput problem, as the more models that can be trained, the better the AI-enabled science. However, running large-scale ML training and evaluation workloads on distributed, heterogeneous infrastructure like the Open Science Pool (OSPool) raises open questions. How does training across different hardware (e.g., A100 to V100 to H100) affect final model performance? What tools and automation can minimize the friction of managing large ML training ensembles? And how can the training process cooperate with workload managers like HTCondor to improve responsiveness and fairness in multi-tenant cyberinfrastructure? To address these challenges, the Partnership to Advance Throughput Computing (PATh) is launching a new collaboration to profile the effects of throughput training and inference on distributed, heterogeneous capacity. Using protein AI as an exemplar scientific domain, the project will characterize the impact of training ensembles across heterogeneous resources, improve capabilities and services to reduce barriers for new ML researchers, and demonstrate running single workloads effectively across NAIRR pilot resources.