Third workshop on Communication, I/O, and Storage at Scale on Next-Generation Platforms – Scalable Infrastructures

ISC 2024 IXPUG Workshop

Workshop Date/Time: May 16, 2024 9:00 AM to 1:00 PM

Location: Hall Y7 - 2nd Floor, in-person at ISC 2024, Hamburg, Germany

Agenda:

All times are shown in CEST / Hamburg Time, UTC+2. Event details are subject to change. Register at: https://www.isc-hpc.com/registration-2024.html The workshop is held in conjunction with ISC 2024, Hamburg, Germany. To attend the IXPUG Workshop, you must register for the ISC 2024 Workshop Pass.

09:00–09:10 a.m. Welcome and Introduction to IXPUG
Amit Ruhela (Texas Advanced Computing Center (TACC))

Session 1 | Chair: Amit Ruhela (Texas Advanced Computing Center (TACC))

09:10-09:40 a.m. Optimizing Communications and I/O on Aurora for Application Performance (Slides)
Authors: Kalyan Kumaran, Kevin Harms (Argonne Leadership Computing Facility, Argonne National Laboratory)
Abstract: The Aurora supercomputer at the Argonne Leadership Computing Facility consists of 10,624 nodes, each with two Intel Data Center CPU Max Series CPUs and four Intel Data Center GPU Max Series GPUs. Within a node, Xe Link and PCIe support coherent memory access to all processors. HPE's Slingshot 11 provides adaptive high-bandwidth interconnect between nodes and to the DAOS-based storage sub-system. This talk will discuss experiences in optimizing applications using XeLinks and eight NICs per compute node, as well as efficiently utilizing DAOS storage. We will examine a few example cases taken from work done during the Aurora Early Science Program, the Exascale Computing Project, and Non-Recurring Engineering work.

09:40-10:05 a.m. Performance Evaluation and Optimization of Seismic Imaging Applications on HBM-Enabled CPUs (Slides)
Authors: Huda Ibeid (Intel Corporation), Pavel Plotnitskii, Kadir Akbudak, Hatem Ltaief, and David Keyes (King Abdullah University of Science & Technology (KAUST))
Abstract: The HBM memory is designed to provide both high bandwidth and low power consumption. In this talk, we will evaluate the improvements brought by the adoption of HBM and explore strategies to maximize the advantages of HBM-enabled CPUs for modeling the 3D acoustic wave equation in the context of seismic modeling. Given the inherent memory-bound nature of the stencil operator in the wave equation, there is significant data movement across the memory subsystem, which could negatively impact the throughput. We exploit the HBM high bandwidth through spatial and temporal data reuse, thus harnessing the performance potential provided by HBM.

10:05 - 10:35 a.m. Using SYCL for the Next Generation Heterogeneous Systems (Slides)
Presenter: Mehdi Goli (Codeplay Software)
Abstract: In order to empower software developers to leverage the full potential of the modern complex heterogeneous HPC systems, a suitable parallel heterogeneous programming model is essential. SYCL is one of the solutions to this challenge, reducing boilerplate by using the expressive power of modern C++, yet allowing users close-to-metal access to their platform. As an open standard maintained by the Khronos group, SYCL provides portability across many different architectures and vendors from a single set of standard C++ code. This versatility makes SYCL ideally suited to be used across a wide range of different applications and performance libraries. In this talk, we are going to explain how to leverage SYCL to develop performance-portable libraries targeting a variety of hardware targets. Using the oneMKL interface library as an example, we are going to demonstrate how these libraries can be used to make HPC applications using discrete Fourier transforms (DFTs) such as GROMACS, or AI applications such as llama.cpp portable across different hardware targets. But SYCL can not only be used to program CPUs, FPGAs and GPUs, it can also be used to integrate new hardware building blocks found in next-generation HPC platforms into a user-friendly programming model. Using the example of Processing in Memory, we are going to demonstrate how SYCL can be extended to integrate such specialized hardware.

10:35-11:00 a.m. High Performance Fabric Support in DAOS (Slides)
Authors: Michael Hennecke, Alexander Oganezov, Jerome Soumagne, John Carrier, and Joseph Moore (Intel Corporation)
Abstract: The Distributed Asynchronous Object Storage (DAOS) is an open source scale-out storage system that is designed from the ground up to support Storage Class Memory (SCM) and NVMe storage in user space. DAOS can run over any TCP network, but can also take advantage of high performance fabrics like InfiniBand, Slingshot, or Omni-Path. This paper describes the networking architecture of DAOS, and discusses scaling and performance aspects of running DAOS over those high performance fabrics.

(11:00 -11:30 a.m. Coffee Break)

Session 2 | Chair: David Martin (Argonne Leadership Computing Facility, Argonne National Laboratory)

11:30 -12:15 p.m. Intel Keynote: Pitfalls and Key Learnings for Performance Modeling (Slides)
Presenter: Philippe (Phil) Thierry (Intel Corporation)
Authors: Philippe (Phil) Thierry, Cedric Andreolli, Sai Chenna, Fabrice Dupros, Sunny Gogar, Sylvain Jubertie, Nalini Kumar, Amine Mrabet, and Mariam Umar (Intel Corporation)
Abstract: In this presentation, we'll look at the various aspects of application performance prediction at different scales and depending on the desired objective.All too often, performance prediction is seen as a simplistic step, whether in the design of processors and large systems, or in calls for tenders. In practice, it remains extremely difficult to predict the behavior of a complete application on a very large scale. In recent years, this complexity has increased still further, with highly heterogeneous machines in terms of computing and communications, and the arrival of new AI applications. As ever, precision and performance remain two orthogonal concepts, and many approximations are needed to make predictions simply feasible.

12:15-12:35 p.m. HPC Experiences with Intel GPU Max for Deep Learning at Scale (Slides)
Authors: Nicholas Charron (Zuse Institute Berlin (ZIB)), Steffen Christgau (Zuse Institute Berlin (ZIB))
Abstract: With the increasing GPU-vendor diversity in HPC centers and with the growing importance of ML/AI in applications in HPC, the question arises how frameworks from the ML/AI domain support GPUs from the different vendors. In addition, the growing size of the employed models requires distributed tasks. Within this talk, we put the focus on using the Intel GPU Max and the Nvidia A100 and compare their training and inference performance for different models as well as their usage with AI/ML frameworks. Besides the performance comparison we shed light on the peculiarities of the Intel GPU Max usage in practice.

12:35-12:55 p.m. Investigating the Performance of LLVM-based Intel Fortran Compiler (ifx) (Slides)
Presenter: Dhani Ruhela (Westwood High School)
Abstract: LLVM is a free, open-source compiler framework for programmatically generating machine-native code. Developers nowadays are increasingly embracing LLVM to develop new languages or modify existing ones. LLVM-based compilers enable shorter build of compilers that are portable across various platforms, easy to maintain, and extensively optimized for the target systems. Intel oneAPI moved to an LLVM infrastructure with C (icx) and C++ (icpx) compilers in the 2021.3 release and a Fortran compiler (ifx) in the 2023.0 release. According to Intel, the LLVM-based compilers are packed with advanced language features and deliver the absolute best performance for various applications on Intel architectures. The LLVM-based Intel compilers have been extensively tuned for the 4th Gen Intel® Xeon Scalable processors (code-named Sapphire Rapids), Intel® Xeon CPU Max Series (code-named Sapphire Rapids HBM) and the Intel® Data Center GPU Max Series (code-named Ponte Vecchio). In this work, I aim to explore the features and performance of LLVM-based compilers compared with legacy compilers on three machine architectures, i.e. Sapphire Rapids with DDR5, Sapphire Rapids with HBM, and Intel Cascade Lake. To my best belief, this is the first extensive study that uncovers the potential of LLVM-based Intel compilers with eight scientific representative codes and demonstrates up to 17% performance improvements with Inter Fortran compiler (ifx) on Intel architectures.

12:55-1:00 p.m. Workshop Closing Remarks
David Martin (Argonne Leadership Computing Facility, Argonne National Laboratory)

Event Description:

Next-generation HPC platforms have to deal with increasing heterogeneity in their subsystems. These subsystems include internal high-speed fabrics for inter-node communication; storage system integrated with programmable data processing units (DPUs) and infrastructure processing units (IPUs) to support software-defined networks; traditional storage infrastructures with global parallel POSIX-based filesystems complemented with scalable object stores; and heterogeneous compute nodes configured with a diverse spectrum of CPUs and accelerators (e.g., GPU, FPGA, AI processors) having complex intra-node communication.

The workshop intends to attract system architects, code developers, research scientists, system providers, and industry luminaries who are interested in learning about the interplay of next-generation hardware and software solutions for communication, I/O, and storage subsystems tied together to support HPC and data analytics at the systems level, and how to use them effectively. The workshop will provide the opportunity to assess technology roadmaps to support AI and HPC at scale, sharing users’ experiences with early-product releases and providing feedback to technology experts. The overall goal is to make the ISC community aware of the emerging complexity and heterogeneity of upcoming communication, I/O, and storage subsystems as part of next-generation system architectures and inspect how these components contribute to scalability in both AI and HPC workloads.

Workshop Format:

The workshop will have a keynote, full (30 min) talks and lightning talks (10-15 min). While in-person presentations are preferred, pre-recorded videos will be allowed as presentations in exceptional cases.

Call for Submissions:

The submission process will close on March 15, 2024 AoE (updated!). All submitters should provide content that represents an Extended Abstract, max. 6-12 pages in LNCS format via the IXPUG EasyChair https://easychair.org/cfp/ISC-2024-IXPUG-Workshop. Notifications will be sent to submitters by March 22, 2024 AoE. The page limit is 12 pages for each paper with 2 possible extra pages after the review to address the reviewer's comments. The page limit includes bibliography and appendices.

Topics of Interest are (but not limited to):

Holistic view on performance of next-generation platforms (with emphasis on communication, I/O, and storage at scale)
Application-driven performance analysis with various HPC fabrics
Software-defined networks in HPC environments
Experiences with emerging scalable storage concepts, e.g., object stores using next-generation HPC fabrics
Performance tuning on heterogeneous platforms from multiple vendors including impact of I/O and storage
Performance and portability using network programmable devices (DPU, IPU)
Best practice solutions for application programming with complex communication, I/O, and storage at scale

Keywords:
high-performance fabrics, data and infrastructure processing units, scalable object stores as HPC storage subsystems, heterogeneous data processing, holistic system view on scalable HPC infrastructures

Review Process:
All submissions within the scope of the workshop will be peer-reviewed and will need to demonstrate the high quality of the results, originality and new insights, technical strength, and correctness. We apply a standard single-blind review process, i.e., the authors will be known to reviewers. The assignment of reviewers from the Program Committee will avoid conflicts of interest.

Important Dates:

Deadline for submissions: March 15, 2024 (updated!)
Acceptance notification: March 22, 2024
Camera ready presentation: May 10, 2024
Workshop date: May 16, 2024

Organizers:

Hatem Ltaief, King Abdullah University of Science & Technology
David Martin, Argonne Leadership Computing Facility
Amit Ruhela, Texas Advanced Computing Center (TACC)

Program Committee:

Aksel Alpay, Heidelberg University
Glenn Brook, Cornelis Networks
Steffen Christgau, Zuse Institute Berlin
Toshihiro Hanawa, The University of Tokyo
Clayton Hughes, Sandia National Laboratories
Nalini Kumar, Intel Corporation
James Lin, Shanghai Jiao Tong University
Hatem Ltaief, King Abdullah University of Science & Technology
David Martin, Argonne National Laboratory
Christopher Mauney, Los Alamos National Laboratory
Amit Ruhela, Texas Advanced Computing Center (TACC)

Contact:

Please contact This email address is being protected from spambots. You need JavaScript enabled to view it. with any general questions.