LAST: Let’s Adapt to System Drift

Wed, 07 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Computer systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Data Science and Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ray Andrew Sinurat (primary contact), Sandeep Madireddy

The performance of computer systems is constantly evolving, a natural outcome of updating hardware, improving software, and encountering hardware quirks over time. At the same time, machine learning (ML) models are becoming increasingly popular. They are being used widely to address various challenges in computer systems, notably in speeding up decision-making. This speed is vital for a quick and flexible response, essential for meeting service-level agreements (SLAs). Yet, an interesting twist has emerged: like the computer systems they aid, ML models also experience a kind of “aging.” This results in a gradual decline in their effectiveness, a consequence of changes in their operating environment.

The phenomenon of model “aging” is a ubiquitous occurrence across various domains, not limited merely to computer systems. This process of aging can significantly impact the performance of a model, emphasizing the critical importance of early detection mechanisms to maintain optimal functionality. In light of this, numerous strategies have been formulated to mitigate the aging of models. However, the generalizability and effectiveness of these strategies across diverse domains, particularly in computer systems, remain largely unexplored. This research aims to bridge this gap by designing and implementing a comprehensive data analysis pipeline. The primary objective is to evaluate the efficacy of various strategies through a comparative analysis, focusing on their performance in detecting and addressing model aging. To achieve a better understanding of this issue, the research will address the following pivotal questions:

Data-Induced Model Aging: What specific variations within the data can precipitate the aging of a model? Understanding the nature and characteristics of data changes that lead to model deterioration is crucial for developing effective prevention and mitigation strategies.
Efficacy of Aging Detection Algorithms: How proficient are the current algorithms in identifying the signs of model aging? Assessing the accuracy and reliability of these algorithms will provide insights into their practical utility in real-world scenarios.
Failure Points in Detection: In what scenarios or under what data conditions do the aging detection mechanisms fail? Identifying the limitations and vulnerabilities of these algorithms is vital for refining their robustness and ensuring comprehensive coverage.
Scalability and Responsiveness: How do these algorithms perform in terms of robustness and speed, particularly when subjected to larger datasets? Evaluating the scalability and responsiveness of the algorithms will determine their feasibility and effectiveness in handling extensive and complex datasets, a common characteristic in computer systems.

To better understand and prevent issues related to model performance, our approach involves analyzing various datasets, both system and non-system, that have shown notable changes over time. We aim to apply machine learning (ML) models to these datasets to assess the effects of these changes on model performance. Our goal is to leverage more advanced ML techniques to create new algorithms that address these challenges effectively. This effort is expected to contribute significantly to the community, enhancing the detection of model aging and improving model performance in computer systems.

Project Deliverable

Run pipeline on several computer systems and non-computer systems dataset
A Trovi artifact for data preprocessing and model training shared on Chameleon Cloud
A GitHub repository containing the pipeline source code

Automatic Cluster Performance Shifts Detection Toolkit

Wed, 01 Feb 2023 10:15:56 -0700

High-performance computing (HPC) clusters typically suffer from performance degradation over time. The heterogeneous nature of clusters and the inevitable defects in various infrastructure layers will result in a harder performance prediction inside. On the other hand, when software upgrades or any such events happen, we might also observe performance improvement or degradation even though nothing changes in the hardware. Due to these uncertainties, it is necessary to send early notification to administrators of changes in cluster performance in a specific time window to inform scheduling decisions and increase cluster utilization.

We are targeting HPC clusters that cater to heterogeneous, compute, and I/O intensive workloads that range from scientific simulation to AI model training that have high degree of parallelization in their workloads. In this scenario, we plan to use the Darshan open-source toolkit (https://github.com/darshan-hpc/darshan) as data collection or profiling tools to design our performance drift algorithms. Furthermore, we will possibly incorporate the distribution shift detection into Darshan, making it viable as a notification to the HPC system administrators.

Our goal is to show the efficacy of our algorithm by plotting the profiling data that display specific time windows where the performance shifts happened after being processed by our algorithm. Finally, we will package all our profiling data and experiment scripts inside Jupyter notebook, especially Chameleon Trovi, to help others reproduce our experiments.

Through this research, we seek to contribute the following:

Designing an algorithm to detect performance shifts in HPC clusters that can be adapted for heterogeneous workloads
Real-time detection of the performance shifts without introducing great overheads into the system
Contribute to Darshan to be able to automatically detect performance changes while profiling the clusters.

Automatic and Adaptive Performance Shifts Detection

Topics: Statistical Machine Learning, Deep Learning, and High-Performance Computing (HPC)
Skills: C++, Python, Statistics, good to have: Machine Learning, Deep learning
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Sandeep Madireddy (https://www.anl.gov/profile/sandeep-r-madireddy, http://www.mcs.anl.gov/~smadireddy/ ), Ray Andrew Sinurat (https://rayandrew.me)
Contributor(s): Kangrui Wang

All in all, these are the specific tasks that the student should do:

Collaborate and work with mentors to understand the goal of this project.
Implement distribution shift detection in pure statistical or machine/deep learning
Deploy the algorithm and try to see its efficacy in the clusters.
Package this experiment to make it easier for others to reproduce

Ray Andrew Sinurat | UCSC OSPO

LAST: Let’s Adapt to System Drift

Automatic Cluster Performance Shifts Detection Toolkit

Automatic and Adaptive Performance Shifts Detection