We have collected presentations from IXPUG workshops, annual meetings, and BOF sessions, and made them accessible here to view or download. You may search by event, keyword, science domain or author’s name. The database will be updated as new talks are made available.
In April 2025, the Hokkaido University Information Initiative Center will introduce a new supercomputer system as part of the Interdisciplinary Large-Scale Computing System. The new supercomputer system features the computing subsystem Grand Chariot 2, delivering a theoretical peak performance of 9 PFLOPS, together with the 19.95 PB all-flash storage system. Grand Chariot 2 consists of 504 compute nodes powered by 5th-generation Intel Xeon CPUs, and 24 of those nodes each have four NVIDIA H100 GPUs. This talk will provide an overview of the new supercomputer system, along with its design concept based on a review of the current supercomputer system.
Keyword(s): IXPUG Workshop at HPC Asia 2025
Author(s): Takeshi Fukaya Video(s): Read more | |In this talk, I will survey the set of scientific applications that are just beginning their exascale run campaigns on the Aurora supercomputer. I will discuss the work of many project teams on targeting Aurora’s Intel GPU accelerators, portability approaches, and performance considerations. These applications sample a broad spectrum of scientific domains, numerical methods, and AI training and inference components.
Keyword(s): IXPUG Workshop at HPC Asia 2025
Author(s): Tim Williams Video(s): Read more | |NVIDIA GPUs have dominated GPU-accelerated supercomputers for over a decade, but AMD and Intel GPUs have recently boosted cutting-edge supercomputers. Increased competition among GPU vendors has driven performance improvement; however, platform and programming environments diverge simultaneously. In this study, we have developed and optimized $N$-body codes written in CUDA C++, HIP C++, and SYCL for NVIDIA, AMD, and Intel GPUs, respectively, to find a promising environment for developing scientific codes. The fastest code on NVIDIA H100 SXM, written in SYCL and compiled by Intel oneAPI, processed 2.16e+12 interactions per second. On AMD Instinct MI210, SYCL code compiled by AdaptiveCpp and HIP C++ code achieved almost identical performance, with SYCL code achieving a slightly higher performance of 9.06e+11 interactions per second. Only the SYCL code compiled by Intel oneAPI was tested on Intel Data Center GPU Max 1100, and the resultant processing rate was 8.87e+11 interactions per second.
Keyword(s): IXPUG Workshop at HPC Asia 2025
Author(s): Yohei Miki Video(s): Read more | |Site Update from TACC: Benchmarking for HPC Systems presented by Amit Ruhela (Texas Advanced Computing Center (TACC), The University of Texas at Austin, US)
Keyword(s): IXPUG Workshop at HPC Asia 2025
Author(s): Amit Ruhela Video(s): Read more | |he National Center for High-performance Computing (NCHC) serves as Taiwan's primary high-performance computing (HPC) facility, providing critical services for computational science, AI, visualization, data storage, networking, and HPC training. Operating the 100 Gbps Taiwan Advanced Research and Education Network (TWAREN), NCHC supports academia and industry through advanced research platforms, HPC technologies, and professional development initiatives. The organization aims to enhance Taiwan's computing ecosystem with world-class AI-HPC infrastructure and achieve ambitious milestones, including a projected 480 petaflops (PF) computing power by 2029 under strategic national programs. Key pillars include AI and big data platforms, secure cloud services, and advanced network infrastructure to foster innovation in various fields such as life sciences, smart cities, and quantum computing. Initiatives like the TAIDE Project focus on trustworthy AI development, while NCHC's advanced facilities provide a robust environment for sensitive data storage, cybersecurity, and 3D visualization. With its emphasis on sustainability and talent cultivation, NCHC plays a central role in Taiwan's technological advancement, positioning itself as a leader in HPC and AI innovations.
Keyword(s): IXPUG Workshop at HPC Asia 2025
Author(s): Steven Shiao Video(s): Read more | |Joint Center for Advanced High Performance Computing (JCAHPC), which is jointly operated under the collaboration between the Center for Computational Sciences, University of Tsukuba, and the Information Technology Center, the University of Tokyo, has started the operation of new supercomputer system “Miyabi” since January 2025. The “Miyabi” system of 80.1 PFLOPS consists of 1,120 compute nodes equipped with NVIDIA GH200 Grace-Hopper Superchips connected via the dedicated high-speed NVLink-C2C, and 190 compute nodes employing dual Intel Xeon Max 9480 processors. Miyabi also includes 11.3PB parallel file system utilizing NVMe-SSDs across all drives. JCAHPC promotes AI for Science by Miyabi based on integration of “Simulation, Data, and Learning."
Keyword(s): IXPUG Workshop at HPC Asia 2025
Author(s): Toshihiro Hanawa Video(s): Read more | |As machine learning models increase in size and complexity rapidly, the cost of checkpointing in ML training became a bottleneck in storage and performance (time). For example, the latest GPT-4 model has massive parameters at the scale of 1.76 trillion. It is highly time and storage consuming to frequently writes the model to checkpoints with more than 1 trillion floating point values to storage. This work aims to understand and attempt to mitigate this problem. First, we characterize the checkpointing interface in a collection of representative large machine learning/language models with respect to storage consumption and performance overhead. Second, we propose the two optimizations: i) A periodic cleaning strategy that periodically cleans up outdated checkpoints to reduce the storage burden; ii) A data staging optimization that coordinates checkpoints between local and shared file systems for performance improvement.
Keyword(s): checkpointing,large machine learning (ML) models,LLM,LLMs,staging,periodic cleaning,large ML model training,checkpointing optimization strategies,GPT-2 variants
Author(s): Kento Sato Video(s): Read more | |In this paper we evaluate multiple parallel programming models with respect to both ease of expression and resulting performance. We do this by implementing the mathematical algorithm known as the `power method' in a variety of ways, using modern C++ techniques.
Keyword(s): stencil code,unstructured data,C++,stencil operations,array parallelism,stencil computation,OpenMP,Kokkos,SYCL
Author(s): Victor Eijkhout Video(s): Read more | |In the microservice paradigm, monolithic applications are decomposed into finer-grained modules invoked independently in a data-flow fashion. The different modules communicate through remote procedure calls (RPCs), which constitute a critical component of the infrastructure. To ensure portable passage of RPC metadata, arguments, and return values between different microservices, RPCs involve serialization/deserialization activities, part of the RPC data center tax. We demonstrate how RPC server logic, including serialization/deserialization, can be offloaded to Data Processing Units (DPUs). This effectively reduces the RPC data center tax on the host, where applications' business logic runs. While we focus on offloading Protocol Buffers deserialization used by the popular gRPC framework, our findings can be applied to other RPC infrastructures. Our experimental results demonstrate that RPC offloading performs similarly to traditional methods while significantly reducing CPU usage.
Keyword(s): Microservice Architecture,Remote Procedure Calls,Deserialization,DMA-based RPC protocol,RDMA
Author(s): Raphael Frantz Video(s): Read more | |Predicting the structure of proteins has been a grand challenge for over 60 years. Google's DeepMind team leveraged Artificial intelligence in 2020 to develop AlphaFold and achieved an accuracy above 90 for two-thirds of the proteins in CASP's competition. AlphaFold has been very successful in biology and medicine. However, a lack of training code and expansive computational requirements created an open-source implementation, OpenFold. OpenFold is fast, memory-efficient, and provides an OpenProtein dataset with five million MSAs. MLCommons added OpenFold to their HPC benchmarks suite in 2023 and was evaluated by four institutions on NVIDIA GPU architectures. This work presents our endeavours to port, run and tune OpenFold on Intel's Ponte Vecchio (PVC) GPUs. To the best of our knowledge, this is the first large-scale study of the distributed implementation of OpenFold application with Intel PVC GPU, presenting the challenges, opportunities and performance of the application on Intel's Max series architecture.
Keyword(s): protein folding,GPU,PVC,AlphaFold,OpenFold,protein structure detection,floating point precision
Author(s): Dhani Ruhela Video(s): Read more | |