High Performance Computing (HPC) | UCSC OSPO

GeFARe: Discovering Reproducible Failure Scenarios and Developing Failure-Aware Scheduling for Genomic Workflows

Sun, 09 Feb 2025 00:00:00 +0000

Topics: genomic processing (e.g., DNA and RNA alignment), workflow scheduling, resource/cluster management, container orchestration
Skills: Linux, cloud computing (e.g., OpenStack), cluster manager (e.g., Kubernetes), systems automation (e.g., Bash/Python/Puppet), genomic workflows and applications (e.g., BWA, FastQC, Picard, GATK, STAR)
Difficulty: Hard
Size: Large (350 hours)
Mentor(s): In Kee Kim

Project Idea description

Large-scale genomic workflow executions require large-scale computing infrastructure, as well as high utilization of that infrastructure, to maximize throughput. Systems researchers have developed various techniques to achieve this goal, including scheduling, resource harvesting, tail mitigation, and failure recovery. However, many of these large-scale efforts have been carried out by separate groups/institutions that operate such large-scale infrastructure (e.g., major tech companies and national research labs). Reproducing and building upon these works at a similar scale in an academic environment is challenging – even labs with strong ties to these institutions often have to rely on trace-based research, which does not fully capture the complexities of real-world deployments.

We observe two fundamental reasons for this difficulty: 1) a lack of computational infrastructure at a comparable scale and 2) a lack of representative workloads and software stacks. Although the academic community has sought to broaden access to large-scale infrastructure through testbeds like ChameleonCloud and CloudLab, the representative workloads and software stacks to reproduce aforementioned works remain limited.

We aim to address this challenge by providing a robust, easy-to-use, and open-source environment for large-scale genomics workflow scheduling. Specifically, this environment will include: a) a suite of tools to set up infrastructure on academic cloud testbeds, b) a scheduling research platform for genomic workflows, and c) software stacks to reproduce large-scale failure scenarios.

We limit the scope of this project to only one or two major failure scenarios. For example, out-of-memory (OOM) failures occur when genomics applications run with insufficient available memory. However, we aim to make the software stack extendable for other scenarios whenever possible.

Throughout this project, students will learn to use cloud testbeds (e.g., ChameleonCloud) for workflow scheduling research. They will gain hands-on experience in open-source cluster management and container orchestration tools (e.g., Kubernetes) and will also learn about various aspects of high-performance computing when genomic workflows.

Finally, we will open-source all the code, software stacks, and datasets created during this project. Using these artifacts, we will also ensure the reproducibility of failure scenarios.

Project Deliverable

Acquire a basic understanding of genomic data processing (will mentor guidance)
Build tools to set up a multi-node cluster on ChameleonCloud
Create automation code/tools to set up genomics workflows’ input and containerized applications
Discovering failure scenarios for genomics workflow execution (will mentor guidance)
Develop a Kubernetes-based platform to implement scheduling policies (Students may use or build upon existing open-source works)
Document the steps needed to reproduce the proposed failure scenarios

Towards Scalable Performance Benchmarking of Genomics Workflows

Thu, 19 Sep 2024 00:00:00 +0000

Project Background

Optimizing genomics workflows execution on a large-scale & heterogeneous cluster requires in-depth understanding of resource requirement and utilization pattern of each application in the workflows. Such information can be obtained by using a benchmarking tool. However, performance data generated by such tool should represent the scale of its target system, lest the design decisions made from it is misguided. My project aims to build GenScale, the first benchmarking tool which can rapidly generate genomics workload performance data at the scale representative of production systems.

As Summer of Reproduciblity (SoR) 2024 comes to an end, I took the time to reflect on my time working on GenScale, the challenges I faced, and the future works & impacts I hope GenScale create for our community.

Milestones & Challenges

The time I spent working on GenScale during SoR can be classified into three phases:

1. Per-Application Container & Input Creation.

Containerization is the current de-facto standard for genomics workflow execution, thus I designed GenScale to execute applications as containers. This requires me to package each application included in the benchmark as a container. I use state-of-art DNA-Seq & RNA-Seq alignment workflows as references for the list of applications & workflow structure. The container images & source files I created are publicy available in GitHub (Deliverables #1)

I also prepare sample inputs for each application to ease the burden of users who do not have sufficient familiarity with genomics applications. The effort is not trivial, because in a workflow, the inputs for a certain step depend on the outputs of previous step(s). Simply speaking, to prepare inputs for the last application in a workflow, we need to get the outputs of applications executed before it, which also requires the outputs of another set of applications, and so on until we arrive at the beginning of workflow. This translates into significant manual labor of carefully tracing & collecting intermediate files from each step of the reference workflows.

All inputs are hosted in a public Google Drive and ChameleonCloud object store (Deliverables #2). In total, I prepared containers and inputs for 7 popular genomics applications: BWA, FastQC, Fastq Cleaner, GATK, Picard, STAR, and Trimmomatic.


Figure 1. Production-grade softwares used in GenScale: Kubernetes for task orchestration, and Prometheus + Grafana for real-time resource monitoring.

2. Components Development.

In this phase, GenScale main components were developed. GenScale consists of three components: (a) Workflow Manager, (b) Task Orchestrator, and (c) Resource Monitor. The Workflow Manager is built from scratch to allow high degree of freedom when scheduling workflows. I use industry-grade solutions for the other components, namely Kubernetes for orchestrating tasks / containers, and Prometheus + Grafana for real-time resource monitoring. My deliverables include semi-automatic installation scripts & easy-to-follow instructions to set up all three components. (Deliverables #3)

3. Performance Data Generation.

The last phase is to use GenScale prototype to generate performance data of each application. I focused on collecting data for three types of resources: compute (CPU utilization), memory (resident set size), and I/O (read & write operations over time). GenScale export these information into a single CSV file to facilitate easy analysis. My deliverables include performance data for DNA-Seq and RNA-Seq workflows. I also provide a sample Python Notebook which analyzes the CPU utilization pattern of each application in DNA-Seq workflow. (Deliverables #4)


Figure 2. CPU utilization pattern of 9 applications in DNA-Seq Alignment workflow collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

Deliverables

This project’s deliverables can be found in the following Github repo: https://github.com/martinluttap/sor24-genscale/tree/main. In summary, the deliverables include:

Container Images
Input Dataset
Source Code
Performance Data & Sample Analysis Notebook

Future Works, Broader Impacts

Understanding workload characteristics is a crucial step for designing efficient scheduling policy & resource management techniques. GenScale and the performance data it can generate might be a starting point for such effort. Furthermore, I hope GenScale will catalyze meaningful engagements between the computer systems community and bioinformatics community. I believe state-of-arts systems techniques can greatly aid the computing efforts among bioinformatics community. Similarly, domain-specific knowledge & problems within bioinformatics provide unique grounds for the systems community to further advance their field.

Halfway Through SoR24: Building a Scalable Performance Benchmarking Tool for Genomics Workflows

Sun, 21 Jul 2024 00:00:00 +0000

Project Overview

Hi! I’m Martin Putra, and I’m working on the “Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster” project under the supervision of In Kee Kim. We are building GenScale, a scalable benchmarking tool for genomics workfload which leverages industrial-grade cluster manager and monitoring systems. GenScale will allow us to generate performance data under a setup that is representative of large-scale production settings. Ultimately, we hope GenScale and the datasets it produces will catalyze engagement between the computer systems and bioinformatics community, thus accelerating the pace of discovery at both fields.

Progress and Challenges

We have built a prototype using Kubernetes as cluster manager and Prometheus for monitoring systems. At its current state, the prototype can support an arbitrary number of compute nodes, owing to Kubernetes’ notable scaling capability. This provides a suitable environment for small- to mid-scale experiments. We leverage ChameleonCloud to provide the necessary computational and reproducibility infrastructure. The monitoring system supports cluster-level, node-level, and container-level metrics collection and failure detection. We integrated Grafana dashboards for visualizations.

The prototype also supports the execution of user-defined workflows. During the design process, we considered integrating one of existing workflow execution systems, such as cwltool, Nextflow, or Cromwell. Each system has its own pros and cons when placed within the context of how we envision GenScale. However, we ultimately decided to build our own workflow execution system in order to provide maximum flexibility for the capabilities we plan to add in the future. For example, we believe it will be interesting to study how hardware heterogeneity affects the performance of each application in the workflow (a well-known workflow scheduling problem). Studying the problem requires capability to schedule execution on specific machines. In addition, if we want to study contention, we may need to execute on machines which are currently running specific workflows, too. While there are ways to do them with existing workflow execution systems + Kubernetes stack, we believe it will be hugely simplified if we build our own workflow execution system.


Figure 1. Proportion of execution time for DNA Alignment applications, executed on Chameleon’s cascadelake_r node with 1500MB paired-end input. y-axis: proportion of application’s exec. time out of the whole workflow’s exec. time, x-axis: top 10 applications accounting for 97% exec. time, sorted by proportion. Other applications are aggregated.

We confirmed GenScale’s capability to produce useful data by executing a DNA alignment workflow and capturing its runtime resource usage. We use Genomics Data Commons’ (GDC) DNA alignment workflow as reference, which has a total of 27 applications ranging from quality check, read trimming, actual alignment, indexing, and various metrics collection. We wrote our own simplified version of the workflow by first analyzing the execution time & resource usage of each application, then we chose 10 applications which represents 97% of the workflow execution time. We took into account that containerization is the de-facto standard for workflow execution among the bioinformatics community. Thus, we packaged each application as its own separate container, then hosted their Dockerfiles & containers in a private Github Container Registry (GHCR). We plan to make them public in the future. Our monitoring system is able to show resource usage in real time. We also built sidecar containers which use Unix’s pidstats to generate a CSV of cores, memory, and storage utilization throughout each workflow’s execution. This will allow easier analysis and data sharing for GenScale’s users.


Figure 2. CPU utilization pattern of BWA, Picard’s CollectWGSMetrics, and Picard’s ValidateSamFile collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

One technical challenge is in automating the creation of Kubernetes cluster and in keeping it alive. We believe GenScale’s users would be interested in the performance of workflows under dynamic cluster sizes, either due to intentional scaling or machine failures. While the current prototype supports creating a cluster with arbitrary nodes, there are still steps which require a reboot when adding nodes. This makes cluster creation and horizontal scaling not fully automated yet. Keeping a cluster alive is also expensive. Since we use ChameleonCloud as our testbed, we have a choice of either keeping the cluster alive at the cost of significant service units (SU) usage, or save SUs by terminating our leases at the cost of rebuilding the cluster from scratch later. We choose a middle ground by keeping only Kubernetes’ control plane alive. The approach works well so far.

Next Steps

For the remaining weeks, we plan to work on the second workflow, namely RNA Alignment. We would also like to add simple user interfaces if time permits. Finally, we plan to package GenScale’s source code, container images, and sample benchmark results for the open-source community. We look forward to the second half of Summer of Reproducibility!

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Wed, 12 Jun 2024 00:00:00 +0000

Hi! I’m Martin, and I will be working on Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster under the mentorship of In Kee Kim. Our work is driven by the scale of computing systems that hosts data commons – we believe that performance characterization of genomics workload should be done rapidly and at the scale similar to production settings. Feel free to check our proposal for more details!

We propose GenScale, a genomics workload benchmarking tool which can achieve both the scale and speed necessary for characterizing performance under large-scale settings. GenScale will be built on top of industrial-grade cluster manager (e.g. Kubernetes), metrics collection & monitoring systems (e.g. Prometheus), and will support comprehensive set of applications used in state-of-art genomics workflows. Initial version developed during this project will include DNA and RNA alignment workflows.

Finally, we believe that open access and reproducible research will greatly accelerate the pace of scientific discovery. We aim to package our artefacts and generated datasets in ways that makes it easiest to replicate, analyze, and build upon. I personally look forward to learn from & contribute to the open source community!

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea description

We aim to characterize the performance of genomic workflows on HPC clusters by conducting two research activities using a broad set of state-of-the-art genomic applications and open-source datasets.

Performance Benchmarking and Characterizing Genomic Workflows:

Topics: High Performance Computing (HPC), Data Analysis, Scientific Workflows
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration, Genomics Applications (e.g. BWA, FastQC, Picard, GATK, STAR)
Difficulty: Medium
Size: Large (350 hours)
Mentor(s): In Kee Kim

In this activity, students will perform comprehensive performance measurements of genomic data processing on HPC clusters using state-of-the-art applications, workflows, and real-world datasets. They will collect and package datasets for I/O, memory, and compute utilization using industry-standard tools and best practices. Measurement will be done using Kubernetes container orchestration on a multi-node cluster to achieve scalability, with either custom-made metrics collection system or integration of existing industry standard tools. (e.g. Prometheus).

Quantifying Performance Interference and Assessing Their Impact on Workflow Execution Time:

Topics: Machine Learning, Data Analysis, and Scientific Workflows and Computations
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration
Difficulty: Difficult
Size: Medium (175 hours)
Mentor(s): In Kee Kim

In this activity, students will measure the slowdown of various applications due to resource contention (e.g. CPU and I/O). Students will analyze whether an application is compute-bound, I/O bound, or both, then analyze the correlation between resource utilization and execution time. Following that, students will assess the impact of per-application slowdown to the slowdown of a whole workflow. To the best of our knowledge, this will be the first study which systematically quantifies per-application interference when running genomics workflow on an HPC cluster.

For both subprojects, all experiments will also be conducted in a reproducible manner (e.g., as a Trovi package or Chameleon VM images), and all code will be open-sourced (e.g., shared on a public Github repo).

Project Deliverable:

A Github repository and/or Chameleon VM image containing source code for application executions & metrics collection. Jupyter notebooks and/or Trovi artifacts containing analysis and mathematical models for application resource utilization & the effects of data quality.