OAC Core: AIMCI: Artificial Intelligence for Managing CyberinfrastructureOverview of AIMCI with three research thrusts (T1-3)
Researchers: Michael E. Papka, Yash Kurkure, Zhiling Lan
Funding: NSF 2402901 Overview Advanced cyberinfrastructure (CI) is undergoing disruptive changes in system architectures and application workloads. Each facility typically hosts a blend of high-performance systems with various configurations and capabilities. These systems are composed of a multitude of components, embodying heterogeneity in various aspects, ranging from computing to memory and storage. Concurrently, the landscape of cyberinfrastructure workloads is rapidly expanding beyond traditional computational simulations to encompass a hybrid mix of applications including deep learning applications, data-centric applications, as well as coupled workflows involving distributed computing systems. The emerging applications come with diverse resource requirements and exhibit distinct computing characteristics. Current resource management approaches are designed for the traditional paradigm, namely numerical simulations on homogeneous systems. They heavily rely on heuristics and manual processing, making them inadequate for addressing the substantial challenges introduced by the evolving workload and system architecture. We advocate the necessity of an artificial intelligence (AI)-guided approach to tackle the complex challenges of CI resource management. We envision that state-of-the-art AI technologies will play a pivotal role in delivering predictive insights and optimizing decision-making in the constantly evolving CI landscape, while humans can focus on critical tasks of oversight and validation. Keywords: advanced cyberinfrastructure; resource management; artificial intelligence; heterogeneous systems; science applications. Intellectual Merit In this project, we propose designing and evaluating AIMCI (Artificial Intelligence for Managing Cyber Infrastructure), an AI-guided framework for improving CI resource management in a computing facility where a diverse array of science applications are scheduled and executed across multiple computing systems. AIMCI not only accommodates various resource demands from different workloads but also enables us to design efficient resource management methods for improving he productivity of the facility. The project will (1) develop new AI models for predictive analysis of resource usage patterns and user behavior; (2) design intelligent strategies to optimize resource management in a complex and dynamically changing computing environment; and (3) build a discrete event-driven simulator for exploratory simulation of CI resource management with human-in-the-loop interaction. Together, these AI-guided components will enable us to address the pressing demands of diverse workloads, optimize resource allocation for heterogeneous computing, and enhance CI productivity. Broader Impacts This project will advance resource management research from a traditional, manual, static, and single-cluster approach to an automated, dynamic, and facility-wide one. It will introduce new technologies to effectively utilize CI for various applications in science and engineering, AI, and data analytics. The resulting AIMCI framework will have a significant impact on the field of high-performance computing (HPC) and benefit a broad spectrum of scientific domains that rely on HPC CI. The software developed in this project will be made public through an open source license, along with any datasets. UIC is a federally designated Minority-Serving Institution (MSI). An integrated education plan will be carried out to educate and train future workforce from K-12 to graduate level. The PIs are committed to promoting underrepresented participants in computing through student recruitment and outreach programs. Date: October 1, 2024 - September 30, 2027 |