In Kee Kim | UCSC OSPO

GeFARe: Discovering Reproducible Failure Scenarios and Developing Failure-Aware Scheduling for Genomic Workflows

Sun, 09 Feb 2025 00:00:00 +0000

Topics: genomic processing (e.g., DNA and RNA alignment), workflow scheduling, resource/cluster management, container orchestration
Skills: Linux, cloud computing (e.g., OpenStack), cluster manager (e.g., Kubernetes), systems automation (e.g., Bash/Python/Puppet), genomic workflows and applications (e.g., BWA, FastQC, Picard, GATK, STAR)
Difficulty: Hard
Size: Large (350 hours)
Mentor(s): In Kee Kim

Project Idea description

Large-scale genomic workflow executions require large-scale computing infrastructure, as well as high utilization of that infrastructure, to maximize throughput. Systems researchers have developed various techniques to achieve this goal, including scheduling, resource harvesting, tail mitigation, and failure recovery. However, many of these large-scale efforts have been carried out by separate groups/institutions that operate such large-scale infrastructure (e.g., major tech companies and national research labs). Reproducing and building upon these works at a similar scale in an academic environment is challenging – even labs with strong ties to these institutions often have to rely on trace-based research, which does not fully capture the complexities of real-world deployments.

We observe two fundamental reasons for this difficulty: 1) a lack of computational infrastructure at a comparable scale and 2) a lack of representative workloads and software stacks. Although the academic community has sought to broaden access to large-scale infrastructure through testbeds like ChameleonCloud and CloudLab, the representative workloads and software stacks to reproduce aforementioned works remain limited.

We aim to address this challenge by providing a robust, easy-to-use, and open-source environment for large-scale genomics workflow scheduling. Specifically, this environment will include: a) a suite of tools to set up infrastructure on academic cloud testbeds, b) a scheduling research platform for genomic workflows, and c) software stacks to reproduce large-scale failure scenarios.

We limit the scope of this project to only one or two major failure scenarios. For example, out-of-memory (OOM) failures occur when genomics applications run with insufficient available memory. However, we aim to make the software stack extendable for other scenarios whenever possible.

Throughout this project, students will learn to use cloud testbeds (e.g., ChameleonCloud) for workflow scheduling research. They will gain hands-on experience in open-source cluster management and container orchestration tools (e.g., Kubernetes) and will also learn about various aspects of high-performance computing when genomic workflows.

Finally, we will open-source all the code, software stacks, and datasets created during this project. Using these artifacts, we will also ensure the reproducibility of failure scenarios.

Project Deliverable

Acquire a basic understanding of genomic data processing (will mentor guidance)
Build tools to set up a multi-node cluster on ChameleonCloud
Create automation code/tools to set up genomics workflows’ input and containerized applications
Discovering failure scenarios for genomics workflow execution (will mentor guidance)
Develop a Kubernetes-based platform to implement scheduling policies (Students may use or build upon existing open-source works)
Document the steps needed to reproduce the proposed failure scenarios

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea description

We aim to characterize the performance of genomic workflows on HPC clusters by conducting two research activities using a broad set of state-of-the-art genomic applications and open-source datasets.

Performance Benchmarking and Characterizing Genomic Workflows:

Topics: High Performance Computing (HPC), Data Analysis, Scientific Workflows
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration, Genomics Applications (e.g. BWA, FastQC, Picard, GATK, STAR)
Difficulty: Medium
Size: Large (350 hours)
Mentor(s): In Kee Kim

In this activity, students will perform comprehensive performance measurements of genomic data processing on HPC clusters using state-of-the-art applications, workflows, and real-world datasets. They will collect and package datasets for I/O, memory, and compute utilization using industry-standard tools and best practices. Measurement will be done using Kubernetes container orchestration on a multi-node cluster to achieve scalability, with either custom-made metrics collection system or integration of existing industry standard tools. (e.g. Prometheus).

Quantifying Performance Interference and Assessing Their Impact on Workflow Execution Time:

Topics: Machine Learning, Data Analysis, and Scientific Workflows and Computations
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration
Difficulty: Difficult
Size: Medium (175 hours)
Mentor(s): In Kee Kim

In this activity, students will measure the slowdown of various applications due to resource contention (e.g. CPU and I/O). Students will analyze whether an application is compute-bound, I/O bound, or both, then analyze the correlation between resource utilization and execution time. Following that, students will assess the impact of per-application slowdown to the slowdown of a whole workflow. To the best of our knowledge, this will be the first study which systematically quantifies per-application interference when running genomics workflow on an HPC cluster.

For both subprojects, all experiments will also be conducted in a reproducible manner (e.g., as a Trovi package or Chameleon VM images), and all code will be open-sourced (e.g., shared on a public Github repo).

Project Deliverable:

A Github repository and/or Chameleon VM image containing source code for application executions & metrics collection. Jupyter notebooks and/or Trovi artifacts containing analysis and mathematical models for application resource utilization & the effects of data quality.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Thu, 02 Feb 2023 00:00:00 +0000

A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.

On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.

In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.

Analyze genomics data quality & build exec. time prediction models

Topics: genomics, data analysis, machine learning
Skills: Linux, Python, Matplotlib, Pandas/Numpy, any ML library
Difficulty: Medium
Size: 350 hours
Mentor(s): In Kee Kim
Contributor(s): Charis Christopher Hulu

Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.

Specific tasks:

Acquire basic understanding of genomics data processing & workflow execution (will be guided by the mentor)
Reproduce past analysis & models built by prior members of the project
Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior
Write a brief analysis as to why those features might work
Build ML models for predicting execution time
Package the analysis in the form of Jupyter notebooks
Package the models in a reloadable format (e.g., pickle)