ECE faculty Zhao Zhang received a NSF Medium grant on Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters

ECE Assistant Professor Zhao Zhang has successfully transfered a NSF Medium award to Rutgers. The project is from the NSF CNS program and titled as “Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters”. The total value is $1,000,000 ($333,105 at Rutgers, Wisconsin lead) for a three-year period.

Large computing clusters, including data centers and supercomputers, are used for a variety of applications including scientific computations and machine learning. Modern compute clusters typically use specialized accelerator hardware to speed up computations. Operators of accelerator-rich clusters aim to have high resource utilization across all users of the cluster. However, these systems are often under-utilized due to performance variability across accelerators; that is, application performance varies across accelerators even when the same application is run on the same type of accelerator. This proposal will develop Fortuna, a set of tools that can be used by cluster operators and researchers to characterize and harness variability across accelerators. First, Fortuna will use new methodologies to characterize how much performance variability exists across a wide range of accelerator hardware. Second, Fortuna will identify which applications are more likely to suffer from performance variability. Finally, Fortuna will include new scheduling mechanisms that can use variability measurements and knowledge about applications to improve utilization.
 
Congratulations to Zhao!