MFNetSim: A Multi-Fidelity Network Simulation Framework for Multi-Trafic Modeling of Dragonfly Systems

An illustration of workload replay module.

Authors: Wang, X., Brown, K. A., Ross, R. B., Carothers, C.D., Lan, Z.

Publication: Proceedings of the 39th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

URL: https://dl.acm.org/doi/10.1145/3729424

In high-performance computing (HPC), modern supercomputers typically provide exclusive computing resources to user applications. Nevertheless, the interconnect network is a shared resource for both inter-node communication and across-node I/O access, among co-running workloads, leading to inevitable network interference. In this study, we develop MFNetSim, a multi-fidelity modeling framework that enables simulation of multi-traffic simultaneously over the interconnect network, including inter-process communication and I/O traffic. By combining different levels of abstraction, MFNetSim can efficiently co-model the communication and I/O traffic occurring on HPC systems equipped with flash-based storage. We conduct simulation studies of hybrid workloads composed of traditional HPC applications and emerging ML applications on a 1,056-node Dragonfly system with various configurations. Our analysis provides various observations regarding how network interference affects communication and I/O traffic.

https://doi.org/10.1145/3729424

Date: June 23, 2025

Document: View PDF

Related Entries

Directory:

Related Categories