LAST: Let’s Adapt to System Drift

Wed, 07 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Computer systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Data Science and Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ray Andrew Sinurat (primary contact), Sandeep Madireddy

The performance of computer systems is constantly evolving, a natural outcome of updating hardware, improving software, and encountering hardware quirks over time. At the same time, machine learning (ML) models are becoming increasingly popular. They are being used widely to address various challenges in computer systems, notably in speeding up decision-making. This speed is vital for a quick and flexible response, essential for meeting service-level agreements (SLAs). Yet, an interesting twist has emerged: like the computer systems they aid, ML models also experience a kind of “aging.” This results in a gradual decline in their effectiveness, a consequence of changes in their operating environment.

The phenomenon of model “aging” is a ubiquitous occurrence across various domains, not limited merely to computer systems. This process of aging can significantly impact the performance of a model, emphasizing the critical importance of early detection mechanisms to maintain optimal functionality. In light of this, numerous strategies have been formulated to mitigate the aging of models. However, the generalizability and effectiveness of these strategies across diverse domains, particularly in computer systems, remain largely unexplored. This research aims to bridge this gap by designing and implementing a comprehensive data analysis pipeline. The primary objective is to evaluate the efficacy of various strategies through a comparative analysis, focusing on their performance in detecting and addressing model aging. To achieve a better understanding of this issue, the research will address the following pivotal questions:

Data-Induced Model Aging: What specific variations within the data can precipitate the aging of a model? Understanding the nature and characteristics of data changes that lead to model deterioration is crucial for developing effective prevention and mitigation strategies.
Efficacy of Aging Detection Algorithms: How proficient are the current algorithms in identifying the signs of model aging? Assessing the accuracy and reliability of these algorithms will provide insights into their practical utility in real-world scenarios.
Failure Points in Detection: In what scenarios or under what data conditions do the aging detection mechanisms fail? Identifying the limitations and vulnerabilities of these algorithms is vital for refining their robustness and ensuring comprehensive coverage.
Scalability and Responsiveness: How do these algorithms perform in terms of robustness and speed, particularly when subjected to larger datasets? Evaluating the scalability and responsiveness of the algorithms will determine their feasibility and effectiveness in handling extensive and complex datasets, a common characteristic in computer systems.

To better understand and prevent issues related to model performance, our approach involves analyzing various datasets, both system and non-system, that have shown notable changes over time. We aim to apply machine learning (ML) models to these datasets to assess the effects of these changes on model performance. Our goal is to leverage more advanced ML techniques to create new algorithms that address these challenges effectively. This effort is expected to contribute significantly to the community, enhancing the detection of model aging and improving model performance in computer systems.

Project Deliverable

Run pipeline on several computer systems and non-computer systems dataset
A Trovi artifact for data preprocessing and model training shared on Chameleon Cloud
A GitHub repository containing the pipeline source code

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Thu, 02 Feb 2023 00:00:00 +0000

A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.

On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.

In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.

Analyze genomics data quality & build exec. time prediction models

Topics: genomics, data analysis, machine learning
Skills: Linux, Python, Matplotlib, Pandas/Numpy, any ML library
Difficulty: Medium
Size: 350 hours
Mentor(s): In Kee Kim
Contributor(s): Charis Christopher Hulu

Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.

Specific tasks:

Acquire basic understanding of genomics data processing & workflow execution (will be guided by the mentor)
Reproduce past analysis & models built by prior members of the project
Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior
Write a brief analysis as to why those features might work
Build ML models for predicting execution time
Package the analysis in the form of Jupyter notebooks
Package the models in a reloadable format (e.g., pickle)

computer systems | UCSC OSPO

LAST: Let’s Adapt to System Drift

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Analyze genomics data quality & build exec. time prediction models