Abstract

High-Performance Computing (HPC) workloads are being widely used to solve complex problems in scientific applications from diverse domains, such as weather forecasting, medical diagnostics, and fluid dynamics simulation. HPC workloads are traditionally executed on baremetal HPC systems, containers, functions, or as workflows or ensembles. These workloads consume a large amount of data and have large memory and storage requirements that typically exceed the limited amount of main memory and storage available on an HPC system. HPC workloads such as deep learning (DL) are executed on platforms such as TensorFlow or PyTorch, are oblivious to the availability and performance profiles of the underlying HPC systems, and do not incorporate resource requirements of the given workloads for distributed training. Function-as-a-Service (FaaS) platforms running HPC functions impose resource-level constraints, specifically fixed memory allocation and short task timeouts, that lead to job failures, thus making these desirable platforms unreliable for guaranteeing function execution and ensuring performance requirements for stateful applications such as DL workloads. Containerized workflow execution of HPC jobs requires several terabytes of memory that exceed node capacity, resulting in excessive data swapping to slower storage, degraded job performance, and failures. Similarly, co-located bandwidth-intensive, latency-sensitive, or short-lived workflows suffer from degraded performance due to contention, memory exhaustion, and higher access latency due to suboptimal memory allocation. Recently, tiered memory systems comprising persistent memory and compute express link (CXL) have been explored to provide additional memory capacity and bandwidth to memory-constrained systems and applications. However, current memory allocation and management techniques for tiered memory subsystems are inadequate to meet the diverse needs of colocated containerized jobs in HPC systems that run workflows and ensembles at scale concurrently. In this research, we propose a framework that makes HPC platforms, workflow management systems (WMS), and HPC schedulers aware of the availability and capabilities of the underlying heterogeneous datacenter resources and optimize the performance of HPC workloads. We propose architectural improvements and new software modules leveraging the latest advancements in the memory subsystem, specifically CXL, to provide additional memory and fast scratch space for HPC workloads to reduce the overall model training time while enabling HPC jobs to efficiently train models using data that is much larger than the installed system memory. The proposed framework manages the allocation of additional CXL-based memory, introduces a fast intermediate storage tier, provides intelligent prefetching and caching mechanisms for HPC workloads. We leverage tiered memory systems for HPC execution and propose efficient memory management policies including intelligent page placement and eviction policies to improve memory access performance. Our page allocation and replacement policies incorporate task characteristics and enable efficient memory sharing between workflows. We integrate our policies with the popular HPC scheduler, SLURM, and container runtime, Singularity, to show that our approach improves tiered memory utilization and application performance. Similarly, we also integrate our framework with a popular DL platform, TensorFlow, and Apache OpenWhisk to introduce infrastructure-aware scheduling, performance optimization of DL workloads, introduce resilience and fault-tolerance to FaaS platforms. The evaluation of our proposed framework reveals improved system utilization, throughput, and performance, as well as reduced training time, failure rate, recovery time, latency, and cold-start time for large-scale deployments.

Library of Congress Subject Headings

High performance computing--Technological innovations; Heterogeneous distributed computing systems

Publication Date

4-2024

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

M. Mustafa Rafique

Advisor/Committee Member

Sudharshan Vazhkudai

Advisor/Committee Member

Fawad Ahmad

Campus

RIT – Main Campus

Plan Codes

COMPIS-PHD

Share

COinS