In the last decade, DL training has emerged as an HPC-scale workload running on large clusters. The dominant communication pattern in distributed data-parallel DL training is allreduce which is used to sum the model gradients across processes during backpropagation phase. Various allreduce algorithms have been developed to optimize communication time in DL training. Given the scale of DL workloads, it is crucial to evaluate the scaling efficiency of these algorithms on a variety of system architectures. We have extended the Structural Simulation Toolkit (SST) to simulate allreduce and barrier algorithms - Rabenseifner, ring, and, dissemination algorithms. We performed a design space exploration (DSE) study with three allreduce algorithms and two barrier algorithms running on six system network topologies for various message sizes. We quantified the performance benefits of using allreduce algorithms which preserve locality between communicating processes. In addition, we evaluated the scaling efficiency of centralized and decentralized barrier algorithms.
SC24 IXPUG Workshop
HPC,HPC and AI,Modeling,Simulation,Collective Algorithms,HPC Network Topologies,Structural Simulation Toolkit