We have collected presentations from IXPUG workshops, annual meetings, and BOF sessions, and made them accessible here to view or download. You may search by event, keyword, science domain or author’s name. The database will be updated as new talks are made available.
As machine learning models increase in size and complexity rapidly, the cost of checkpointing in ML training became a bottleneck in storage and performance (time). For example, the latest GPT-4 model has massive parameters at the scale of 1.76 trillion. It is highly time and storage consuming to frequently writes the model to checkpoints with more than 1 trillion floating point values to storage. This work aims to understand and attempt to mitigate this problem. First, we characterize the checkpointing interface in a collection of representative large machine learning/language models with respect to storage consumption and performance overhead. Second, we propose the two optimizations: i) A periodic cleaning strategy that periodically cleans up outdated checkpoints to reduce the storage burden; ii) A data staging optimization that coordinates checkpoints between local and shared file systems for performance improvement.
Keyword(s): checkpointing,large machine learning (ML) models,LLM,LLMs,staging,periodic cleaning,large ML model training,checkpointing optimization strategies,GPT-2 variants
Author(s): Kento Sato Video(s): Read more | |In this paper we evaluate multiple parallel programming models with respect to both ease of expression and resulting performance. We do this by implementing the mathematical algorithm known as the `power method' in a variety of ways, using modern C++ techniques.
Keyword(s): stencil code,unstructured data,C++,stencil operations,array parallelism,stencil computation,OpenMP,Kokkos,SYCL
Author(s): Victor Eijkhout Video(s): Read more | |In the microservice paradigm, monolithic applications are decomposed into finer-grained modules invoked independently in a data-flow fashion. The different modules communicate through remote procedure calls (RPCs), which constitute a critical component of the infrastructure. To ensure portable passage of RPC metadata, arguments, and return values between different microservices, RPCs involve serialization/deserialization activities, part of the RPC data center tax. We demonstrate how RPC server logic, including serialization/deserialization, can be offloaded to Data Processing Units (DPUs). This effectively reduces the RPC data center tax on the host, where applications' business logic runs. While we focus on offloading Protocol Buffers deserialization used by the popular gRPC framework, our findings can be applied to other RPC infrastructures. Our experimental results demonstrate that RPC offloading performs similarly to traditional methods while significantly reducing CPU usage.
Keyword(s): Microservice Architecture,Remote Procedure Calls,Deserialization,DMA-based RPC protocol,RDMA
Author(s): Raphael Frantz Video(s): Read more | |Predicting the structure of proteins has been a grand challenge for over 60 years. Google's DeepMind team leveraged Artificial intelligence in 2020 to develop AlphaFold and achieved an accuracy above 90 for two-thirds of the proteins in CASP's competition. AlphaFold has been very successful in biology and medicine. However, a lack of training code and expansive computational requirements created an open-source implementation, OpenFold. OpenFold is fast, memory-efficient, and provides an OpenProtein dataset with five million MSAs. MLCommons added OpenFold to their HPC benchmarks suite in 2023 and was evaluated by four institutions on NVIDIA GPU architectures. This work presents our endeavours to port, run and tune OpenFold on Intel's Ponte Vecchio (PVC) GPUs. To the best of our knowledge, this is the first large-scale study of the distributed implementation of OpenFold application with Intel PVC GPU, presenting the challenges, opportunities and performance of the application on Intel's Max series architecture.
Keyword(s): protein folding,GPU,PVC,AlphaFold,OpenFold,protein structure detection,floating point precision
Author(s): Dhani Ruhela Video(s): Read more | |Interconnects have always played a cornerstone role in HPC. Since the inception of the Top500 ranking, interconnect statistics have been predominantly dominated by two compet- ing technologies: InfiniBand and Ethernet. However, even if Ethernet increased its popularity due to versatility and cost- effectiveness, InfiniBand used to provide higher bandwidth and continues to feature lower latency. Industry seeks for a further evolution of the Ethernet standards to enable fast and low- latency interconnect for emerging AI workloads by offering competitive, open-standard solutions. This paper analyzes the early results obtained from two systems relying on an HPC Ethernet interconnect, one relying on 100G and the other on 200G Ethernet. Preliminary findings indicate that the Ethernet- based networks exhibit competitive performance, closely aligning with InfiniBand, especially for large message exchanges.
Keyword(s): Ethernet Interconnect,data-centric clusters,remote direct memory access,RDMA,RoCE,HAICGU system,Nanjing cluster
Author(s): Lorenzo Pichetti Video(s): Read more | |In the last decade, DL training has emerged as an HPC-scale workload running on large clusters. The dominant communication pattern in distributed data-parallel DL training is allreduce which is used to sum the model gradients across processes during backpropagation phase. Various allreduce algorithms have been developed to optimize communication time in DL training. Given the scale of DL workloads, it is crucial to evaluate the scaling efficiency of these algorithms on a variety of system architectures. We have extended the Structural Simulation Toolkit (SST) to simulate allreduce and barrier algorithms - Rabenseifner, ring, and, dissemination algorithms. We performed a design space exploration (DSE) study with three allreduce algorithms and two barrier algorithms running on six system network topologies for various message sizes. We quantified the performance benefits of using allreduce algorithms which preserve locality between communicating processes. In addition, we evaluated the scaling efficiency of centralized and decentralized barrier algorithms.
Keyword(s): HPC,HPC and AI,Modeling,Simulation,Collective Algorithms,HPC Network Topologies,Structural Simulation Toolkit
Author(s): Sai P. Chenna Video(s): Read more | |Modern supercomputers host numerous jobs that compete for shared storage resources, causing I/O interference and performance degradation. Solutions based on software-defined storage (SDS) emerged to address this issue by coordinating the storage environment through the enforcement of QoS policies. However, these often fail to consider the scale of modern HPC infrastructures. In this work, we explore the advantages and shortcomings of state-of-the-art SDS solutions and highlight the scale of current production clusters and their rising trends. Furthermore, we conduct the first experimental study that sheds new insights into the performance and scalability of flat and hierarchical SDS control plane designs. Our results, using the Frontera supercomputer, show that a flat design with a single controller can scale up to 2,500 nodes with an average control cycle latency of 41 ms, while hierarchical designs can handle up to 10,000 nodes with an average latency ranging between 69 and 103 ms.
Keyword(s): SDS Controllers,Modern HPC Infrastructures,storage,software-defined storage solutions,control plane designs,centralized design,hierarchical design,flat design
Author(s): Mariana Martins Miranda Video(s): Read more | |During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant, relying on highly-optimized, yet platform-specific and inflexible vendor-optimized libraries. Such libraries provide close-to-peak performance on specific platforms, kernels and shapes thereof that vendors have dedicated optimizations efforts, while they underperform in the remaining use-cases, yielding non-portable codes with performance glass-jaws. This talk will shade light on abstraction efforts, mainly targeting CPUs and widening to GPUs the close the approaches get to DSLs/Compilers. We will introduce the Tensor Processing Primitives (TPP) as an virtual and software-defined ISA abstraction in form of ukernels. Subsequently we will cover programming abstractions on top of TPP which is carried out in two steps: 1) Expressing the computational core using Tensor Processing Primitives (TPPs): a compact, versatile set of 2D-tensor operators, 2) Expressing the logical loops around TPPs in a high-level, declarative fashion whereas the exact instantiation (ordering, tiling, parallelization) is determined via simple knobs. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms. We will close the talk by demonstrating how TPP can be the architectural target of a tensor compiler which in turn is then able to generate hand-coded performance.
Keyword(s): Parallel Computing,HPC,TPP,TPP-MLIR,CPU,GPU,Triton-CPU
Author(s): Alexander Heinecke Video(s): Read more | |This presentation introduces an innovative approach that combines Large Language Models (LLMs) and differentiable rendering techniques to automate the construction of digital twins. In our approach, we employ LLMs to guide and optimize the placement of objects in digital twin scenarios. This is achieved by integrating LLMs with differentiable rendering, a method traditionally used for optimizing object positions in computer graphics based on image pixel loss. Our technique enhances this process by incorporating a second modality, namely Lidar data, resulting in faster convergence and improved accuracy. This fusion of sensor inputs proves invaluable, especially for applications like autonomous vehicles, where establishing the precise location of multiple actors in a scene is crucial. Our methodology involves several key steps: (1) Generating a point cloud of the scene via ray casting, (2) Extracting lightweight geometry from the point cloud using PlaneSLAM, (3) Creating potential camera paths through the scene, (4) Selecting the most suitable camera path by leveraging the LLM in conjunction with image segmentation and classification, and (5) Rendering the camera flight path from its origin to the final destination. The technical backbone of this system includes the use of Mitsuba for ray tracing, powered by Intel's Embree ray tracing library. This setup encompasses Lidar simulation, image rendering, and a final differentiable rendering step for precise camera positioning. Future iterations may incorporate Intel OSPRay for enhanced Lidar-like ray casting and image rendering, with a possible integration of Mitsuba for differentiable render camera positioning. The machine learning inference chain utilizes a pre-trained LLM from OpenAI accessed via LangChain, coupled with GroundingDINO for zero-shot image segmentation and classification within PyTorch. This entire workflow is optimized for performance on the latest generation of Intel CPUs. This presentation will delve into the technical details of this approach, demonstrating its efficacy in automating digital twin construction and its potential applications in various industries, particularly in the realm of autonomous vehicle navigation and scene understanding.
Keyword(s): LLMs,differentiable rendering,digital twins,ray tracing,Lidar simulation,autonomous vehicles,in situ visualization
Author(s): Krishna Kumar Video(s): IXPUG Webinar: Leveraging LLMs and Differentiable Rendering for Automating Digital Twin Construction Read more | |The Aurora exascale system is currently being deployed at Argonne National Lab. The system, utilizing Intel’s new Data Center Max Series GPUs (a.k.a. PVC) and Xeon Max Series CPU with HBM, will provide a uniquely powerful platform for leading-edge HPC, AI, and data-intensive computing applications. Scientists at Argonne National Laboratory, in collaboration with the Exascale Computing Project, Intel, and several other institutions, are preparing several dozen applications and workflows to run at scale on the Aurora system. This talk will present an overview of the Aurora system and highlights from the experience of preparing applications for the system. In addition, promising early performance results on the Aurora hardware will be shown.
Keyword(s): Exascale,Aurora,PVC,Ponte Vecchio,GPU Max,CPU Max,HBM,Data Center Max Series GPUs,Xeon Max Series CPU with HBM
Author(s): Scott Parker Video(s): IXPUG Webinar: Preparing for Exascale on Aurora Read more | |