Abstract

The exponential growth of data-intensive scientific simulations and deep learning workloads presents significant challenges for high-performance computing~(HPC) systems. These workloads generate massive data volumes at unprecedented velocities, straining the capabilities of existing memory hierarchies, I/O subsystems, and scheduling mechanisms. This dissertation addresses critical challenges in data management and workload scheduling to enhance performance, scalability, and resource efficiency in next-generation HPC and Artificial Intelligence~(AI) applications. Firstly, we develop advanced data and memory management strategies to alleviate I/O bottlenecks in deep memory hierarchies of modern HPC environments. We significantly mitigate data movement overhead by exploiting application-specific data access patterns and optimizing data placement and movement across multiple memory tiers, e.g., GPU high-bandwidth memory, DRAM, and persistent storage. Our approaches encompass intelligent caching, prefetching, asynchronous transfers, lazy memory allocation, harvesting idle memory of peer GPUs, and co-optimizing data compression with transfer schedules. These strategies enhance efficiency, yielding up to 22$\times$ faster I/O performance compared to state-of-the-art data management runtimes. Secondly, we tackle substantial overheads in checkpointing and memory management during large-scale training of transformer-based large language models (LLMs). We introduce near-zero-overhead checkpointing mechanisms that minimize training stalls, enabling frequent state captures essential for fault tolerance and analysis. This approach reduces checkpoint stalls by up to 48$\times$ in distributed training setups. Additionally, we propose hybrid CPU-GPU execution strategies that optimize resource utilization and overlap data transfers with computation to hide I/O bottlenecks, accelerating LLM pre-training by up to 2.5$\times$ in memory-constrained environments. Lastly, we propose data-aware workload scheduling methodologies at the data center level to mitigate I/O bottlenecks. By utilizing a mix of application-level and system-level checkpoints, we accommodate high-priority, deadline-constrained jobs within the HPC batch queuing model, preserving the progress made by low-priority batch jobs during preemption and improving overall system throughput. Furthermore, we address the challenge of data-intensive distributed deep learning workload scheduling across hybrid multi-cloud environments with diverse compliance requirements and varying data volumes. Leveraging metadata catalogs and dynamic discovery in modern data fabric architectures, we propose data-density and compliance-aware workload scheduling strategies, achieving up to 12$\times$ faster execution. Collectively, these contributions advance the state-of-the-art in data management and workload scheduling for HPC systems. The proposed solutions are generalizable and modular, facilitating integration into existing HPC applications and data management ecosystems while being extensible to future HPC testbeds. Extensive evaluations demonstrate the effectiveness of our approaches in enhancing performance and scalability, providing a foundational framework for accelerating next-generation scientific and AI workloads in modern HPC environments.

Publication Date

12-2024

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

M. Mustafa Rafique

Advisor/Committee Member

Bogdan Nicolae

Advisor/Committee Member

Mohan Kumar

Campus

RIT – Main Campus

Share

COinS