CSI HPC Seminar Series 2022

US/Eastern
Description

Biweekly seminar presentations hosted by the High Performance Computing group at the Computational Science Initiative of Brookhaven National Laboratory, covering a wide range of topics in programming models, compilers, application optimizations on high performance computing systems and architectures. 

Regular time: 2pm US Eastern, on Wednesdays 

    • 12:00 13:00
      Performance portability with the SYCL programming model 1h

      Advancements in high performance computing (HPC) have provided unprecedented potential for scientific research and discovery. To help address the “many platforms problem”—stemming from major semiconductor vendors all staking their claim in the market—are the numerous programming models under development which aim for performance portability. This talk will discuss such programming models, and present recent studies on performance portability, with a focus on SYCL: a single-source heterogeneous programming paradigm from the Khronos Group.

      Speaker: Vincent Pascuzzi (Brookhaven National Laboratory)
    • 14:00 15:00
      Preparing the PIConGPU for the next-generation computing systems 1h

      This talk will highlight journey thus far preparing the high performance computing software stack for large complex scientific applications such as OLCFs CAAR’s PIConGPU for Frontier. The talk will cover recent results, tools used, and programming models used for preparing PIConGPU on pre-exascale systems. OLCF’s Center for Accelerated Application Readiness (CAAR) is created to ready applications for the facility’s next-generation supercomputers. PIConGPU is one of the 8 CAAR projects chosen for Frontier.

      Speaker: Sunita Chandrasekaran (University of Delaware and Brookhaven National Laboratory)
    • 14:00 15:00
      High-Performance Tensor Algebra for Chemistry and Materials 1h

      Abstract: Tensor algebra is the foundation of quantum simulation in all contexts, including predictive chemistry and materials simulation. Unlike the linear algebra (of vectors and matrices), tensor algebra is significantly richer, less understood formally, has less mature software ecosystem, and most importantly puts more emphasis on exploiting data sparsity. In this talk I will review the key computational challenges
      of tensor algebra, especially on modern large-scale heterogeneous HPC platforms, and highlight some of our recent work on the open-source TiledArray tensor framework for data-sparse tensor algebra on distributed memory and heterogeneous platforms and its applications in the context of computational chemistry.

      Speaker: Eduard Valeyev (Virginia Tech)
    • 14:00 15:00
      Covariant programme : a programming approach to target both SIMD and SIMT execution 1h

      Abstract: Discussion of the cross platform programming approach to modern HPCs taken by “Grid”, a high performance QCD C++ library. It targets both SIMD intrinsics vectorisation on modern CPUs and SIMT offload models with HIP, SYCL and Cuda back ends. It allows single source high
      performance kernels to be developed that support all of these targets. I discuss the software approaches and (to the extent allowed) the performance of the code on a number of current or planned platforms including Perlmutter, Frontier and Aurora.

      Speaker: Prof. Peter Boyle (BNL)
    • 14:00 15:00
      Dynamic Loop Scheduling across Multi-xPUs Heterogeneous Processors in Nodes of DoE's Exascale Supercomputers 1h

      Abstract: Performance of science and engineering simulations on supercomputers is dependent on communication across nodes and computation performance within a node. With the Moore's Law costs of data movement across the interconnect network, the next-generation supercomputers - particularly those in the DoE - will have the same number of nodes on a supercomputer, but the nodes will actually become more powerful and extremely heterogeneous, with a set of CPUs (multi-cores) and a set of GPUs (multi-devices) on them. Particularly because of application load imbalance and load imbalance due to system noise and complexities of the node's hardware, managing the computational resources on these nodes is challenging. In this talk, I will discuss support in the DoE Exascale Computer Project (ECP) Software Stack to parallelize MPI+OpenMP offload ECP applications across heterogeneous processors/accelerators through user-defined and custom-tuned locality-sensitive loop scheduling with LLVM’s OpenMP along with interoperability of the MPI and OpenMP runtime systems.

      Speaker: Vivek Kale (BNL)
    • 14:00 15:00
      Designing Efficient Graph Algorithms Through Proxy-Driven Codesign and Analysis 1h

      Abstract: Developing scalable graph algorithms is challenging, due to the inherent irregularities in the graph structure and memory-access intensive computational pattern. Proxy application driven software-hardware codesign plays a vital role in driving innovation among the developments of applications, software infrastructure and hardware architecture. Proxy applications are self-contained and simplified codes that are intended to model the performance-critical computations within applications.

      In this talk, I will discuss facilitating software-hardware codesign through proxy applications with the goal of improving the performance of graph analytics workflows on heterogeneous systems. However, even representative proxy applications may be insufficient to diagnose performance bottlenecks of common graph computational patterns at scale. Therefore, we also discuss the role of derivative benchmarks in enhancing graph applications on HPC systems. We will drive the discussion using three case studies--Graph matching, clustering/community detection and triangle counting, which have applications in the domains of proteomics, computational biology, cybersecurity, numerical analysis and other data science scenarios.

      Speaker: Sayan Ghosh (Pacific Northwest National Laboratory)
    • 14:00 15:00
      Experiences with Ookami – an Fujitsu A64FX testbed 1h

      Abstract: Stony Brook’s computing technology testbed, Ookami, provides researchers worldwide with access to Fujitsu A64FX processors. This processor developed by Riken and Fujitsu for the Japanese path to exascale computing and is currently deployed in the fastest computer in the world, Fugaku. Ookami is the first open deployment of this technology outside of Japan. This Cray Apollo 80 system entered its second year of operations. In this presentation we will share our experiences gained during this exciting first project period. This includes a project overview, details of processes such as onboarding users, account administration, user support and training, and outreach. The talk will also give technical details such as an overview of the compilers, which play a crucial role in achieving good performance. To support users to use the system efficiently we offer various opportunities such as webinars, hands-on sessions and we also try to sustain an active user community enabling exchange between the different research groups. In February 2022 the first Ookami user group meeting took place. We will present the key findings and give an outlook on the next project year, where Ookami will become an XSEDE service provider.

      Speaker: Eva Siegmann (Stony Brook University)
    • 14:00 15:00
      Particle Accelerators Modeling at Exascale 1h

      Abstract: Particle accelerators, among the largest, most complex devices, demand increasingly
      sophisticated computational tools for the design and optimization of the next generation of
      accelerators that will meet the challenges of increasing energy, intensity, accuracy, compactness,
      complexity and efficiency. It is key that contemporary software take advantage of the latest
      advances in computer hardware and scientific software engineering practices, delivering speed,
      reproducibility and feature composability for the aforementioned challenges. We will describe the
      software stack that is being developed at the heart of the Beam pLasma Accelerator Simulation
      Toolkit (BLAST) by LBNL and collaborators. We first describe how the US DOE Exascale Computing
      Project (ECP) application WarpX [1-3] will exploit the power of GPUs and its performance on
      Exascale supercomputers for the modeling of laser-plasma acceleration. We then describe how we
      are leveraging the ECP experience to develop a new generation ecosystem of codes that,
      combined with machine learning, will deliver from ultrafast to ultraprecise modeling for future
      accelerator design and operations, towards enabling virtual twins of accelerators.

      Speaker: Axel Huebl (Lawrence Berkeley National Laboratory)
    • 14:00 15:00
      NWQSim: Scalable Simulation of Quantum Systems on Classical Heterogeneous HPC Clusters 1h

      Abstract: Despite fascinating developments in NISQ based quantum computing recently, simulations of quantum programs in classical HPC systems are still essential in validating quantum algorithms, understanding the noise effect, and designing robust quantum algorithms. To allow efficient large-scale noise-enabled simulation on state-of-the-art heterogeneous supercomputers, we developed NWQSim, a quantum circuit simulation environment that provides support for frontends such as Q#, Qiskit, OpenQASM, etc., and backends such as X86/Power CPUs, NVIDIA/AMD GPUs, and Xeon-Phis, through state-vector and density matrix. NWQSim can scale out to more than a thousand GPUs/CPUs on ORNL Summit and has been tested on ORNL Spock, ANL Theta and NERSC Cori, achieving significant speedups over existing approaches. In this talk, I will describe the various techniques that enable high-performance, scalability and portability, showing you a handful HPC tool for noisy quantum system simulation. NWQSim is supported by the U.S. DOE Quantum Science Center (QSC).

      Speaker: Ang Li (Pacific Northwest National Laboratory)
    • 14:00 15:00
      The TAU Performance System 1h

      Abstract: The TAU Performance System® is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, UPC, Java, and Python. TAU (Tuning and Analysis Utilities) is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements as well as event-based sampling. The API also provides selection of profiling groups for organizing and controlling instrumentation. The instrumentation can be inserted in the source code using an automatic instrumentor tool based on the Program Database Toolkit (PDT), with compiler plugins, dynamically using DyninstAPI, at runtime in the Java Virtual Machine, or manually using the instrumentation API. TAU's profile visualization tool, ParaProf, provides graphical displays of the performance measurements, in aggregate and single node/context/thread forms. The user can quickly identify sources of performance bottlenecks in the application using the graphical interface. In addition, TAU can generate event traces that can be displayed with the Vampir or JumpShot trace visualization tools. In this talk, we will introduce the basic concepts of performance measurement and analysis for HPC, discuss the many different programming model support available in TAU, and how they can all be utilized to identify performance bottlenecks in applications.

      Speaker: Kevin Huck (University of Oregon)
    • 14:00 15:00
      COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading 1h

      Abstract: The HPC industry is inexorably moving towards an era of extremely heterogeneous architectures, with more devices configured on any given HPC platform and potentially more kinds of devices, some of them highly specialized. Writing a separate code suitable for each target system for a given HPC application is not practical. The better solution is to use directive-based parallel programming models such as OpenMP. OpenMP provides a number of options for offloading a piece of code to devices like GPUs. To select the best option from such options during compilation, most modern compilers use analytical models to estimate the cost of executing the original code and the different offloading code variants. Building such an analytical model for compilers is a difficult task that necessitates a lot of effort on the part of a compiler engineer. Recently, machine learning techniques have been successfully applied to build cost models for a variety of compiler optimization problems. In this paper, we present COMPOFF, a cost model that statically estimates the Cost of OpenMP OFFloading using a neural network model. We used six different transformations on a parallel code of Wilson Dslash Operator to support GPU offloading, and we predicted their cost of execution on different GPUs using COMPOFF during compile time. Our results show that this model can predict offloading costs with a root mean squared error in prediction of less than 0.5 seconds. Our preliminary findings indicate that this work will make it much easier and faster for scientists and compiler developers to port legacy HPC applications that use OpenMP to new heterogeneous computing environments.

      Speaker: Alok Mishra (Stony Brook University)
    • 14:00 15:00
      Building a performant data infrastructure using Apache Arrow 1h

      Abstract: As data volumes have increased over the past few years, it has become challenging for data scientists and researchers to analyze and extract insights from large datasets. One of the main challenges with existing software is the ability to achieve high performance, portability, and programmability across different platforms. At Voltron Data, we are building a unified computing infrastructure on top of the Apache Arrow ecosystem, that will allow developers in any domain to write code in multiple programming languages used in data science and scale their applications from personal computers to large compute clusters. In this presentation, we will talk about how we are integrating data analytics components and tools with existing open-source offerings in an effort to connect data producers to data consumers.

      Biography: Fernanda is a Director of Developer Relations at Voltron Data specializing in data science and high-performance computing (HPC). She began her HPC career in graduate school, developing molecular dynamics applications and as an administrator of the Florida Laboratory for Materials Engineering Simulation (FLAMES) at the University of Florida. Following graduate school, she was HPC manager at an agricultural genomics company, Genus Plc. and moved to Oak Ridge National Lab’s Leadership Computing Facility, OLCF. While there she was the training coordinator for systems in the facility including Jaguar, Titan, and Summit, the latter two became #1 GPU-based supercomputers in the Top500, where she created the GPU Hackathon series. She was also part of the CORAL project in the programming environment team that selected Summit as the center's next supercomputer. She became an HPC Data Scientist within the Biomedical Sciences,
      Engineering, and Computing group (BSEC) working on Pilot 3 of the CANDLE (Cancer Moonshot) project and was co-PI of Kokkos C++ library funded by Exascale Computing Project (ECP). After ORNL, Fernanda was a Developer Advocate and Alliance Manager for HPC + AI at NVIDIA where she helped to build an ecosystem to support GPU developers and users in the life sciences. More recently she helped support healthcare related efforts during COVID as a Sr Scientific Consultant at BioTeam and now at Voltron Data she helps build better tools for data scientists.

      Speakers: Fernanda Foertter (Voltron Data Inc.), Zahra Ronaghi (Voltron Data Inc.)
    • 14:00 15:00
      Chimbuko - A workflow-level performance analysis tool 1h

      Abstract: Many modern scientific analyses on HPC machines utilize workflows comprising multiple components coexisting on the same hardware resources. In such complex systems there is a significant potential for performance issues that arise only when the workflow is run together and at-scale, and as such will not be captured by traditional benchmarking and profiling of the individual components. Identifying the root cause of these issues using detailed application traces is typically impractical due to the sheer volume of data. Chimbuko circumvents this issue by performing an in situ real-time streaming analysis of trace data, focusing on identifying and recording only performance abnormalities using machine learning techniques. I will discuss the design of the tool and provide examples and instructions on how to deploy and run.

      Biography: Christopher joined the Chimbuko project in 2020 and is the lead developer for the tool's backend. He has a background in lattice QCD, a computational branch of theoretical particle physics. He obtained his PhD at Edinburgh University in 2010, and has completed postdoctoral positions at Columbia University and BNL's Physics Dept. During this time he gained much practical experience with running, developing and optimizing scientific software for HPC systems. Between 2016 and 2020 he held an associate staff scientist position at Columbia funded by Intel, where he worked to evaluate lattice QCD algorithms on prototype exascale computer architectures. He joined the Computational Science Institute at BNL in early 2020 where he divides his time between the Chimbuko project and developing and optimizing lattice QCD software for the exascale machines.

      Speaker: Christopher Kelly (Brookhaven National Laboratory)
    • 13:00 14:00
      SciServer: a Science Platform for Astronomy and Beyond 1h

      Abstract: SciServer is a so-called "Science Platform" developed at the Institute for Data Intensive Engineering and Science (IDIES), with the primary goal of "bringing analysis to the data" in a collaborative manner. Originally developed to simplify access to the Sloan Digital Sky Servey (SDSS) by way of enabling users to write custom SQL against large relational databases, it has grown to enable access to a wide range of resources and data using a variety of software tools all accessible via the web without the need for data downloads. Users can focus on their core competencies without worrying about software installs or compatibility issues, and can easily share results for another use to pick up and leverage their own expertise. Research groups with specific computational needs can benefit from private accelerated computing resources such as GPUs, and parallel hyper-local computing with Spark and Dask. While SciServer is a platform hosted at JHU freely available for the public, it is also a vended software system installed a sites around the world both private and public. In this talk I will discuss SciServer from both a user perspective and as a cyberinfrastructure system and cover recent developments and future directions for the platform.

      Biography: Arik obtained his undergraduate degree in physics from Humboldt State University in Northern California. He then went on to work in the data systems division at the Harvard Smithsonian Center for Astrophysics in Cambridge MA, helping develop the Chandra X-ray observatory. After some research work there, he went to get his PhD in Astronomy at Macquarie University in Sydney, AUS, and then moved to Tokyo, Japan to work as a software engineer in the Search Engine division of Amazon. Finally, he went to IDIES/JHU in 2019, where he works on development of the SciServer platform - but also is a heavy user of the system for data science tasks within the institute.

      Speaker: Arik Mitschang (Johns Hopkins University)
    • 14:00 15:00
      ExaWorks: Building Blocks for High-Performance Workflows 1h

      Abstract: We motivate workflows as a powerful computational methodology and their importance for scientific discovery. We argue workflows present the highest-level of execution and programming parallelism, and outline challenges in realizing this potential. We discuss coupling learning methods to traditional HPC simulations (“AI-coupled HPC” workflows), and the unique performance advantages and challenges thereof. We then introduce RADICAL-Cybertools — the first realization of the building blocks approach to workflow middleware — and outline how RADICAL middleware building blocks address the performance and scalability challenges of AI-coupled-HPC workflows. We conclude with a discussion of ExaWorks — an ECP software technology project — that realizes many of the conceptual advances and the vision of a community building blocks for workflows middleware.

      Speaker: Shantenu Jha (BNL)