Martin L. Putra | UCSC OSPO

Towards Scalable Performance Benchmarking of Genomics Workflows

Thu, 19 Sep 2024 00:00:00 +0000

Project Background

Optimizing genomics workflows execution on a large-scale & heterogeneous cluster requires in-depth understanding of resource requirement and utilization pattern of each application in the workflows. Such information can be obtained by using a benchmarking tool. However, performance data generated by such tool should represent the scale of its target system, lest the design decisions made from it is misguided. My project aims to build GenScale, the first benchmarking tool which can rapidly generate genomics workload performance data at the scale representative of production systems.

As Summer of Reproduciblity (SoR) 2024 comes to an end, I took the time to reflect on my time working on GenScale, the challenges I faced, and the future works & impacts I hope GenScale create for our community.

Milestones & Challenges

The time I spent working on GenScale during SoR can be classified into three phases:

1. Per-Application Container & Input Creation.

Containerization is the current de-facto standard for genomics workflow execution, thus I designed GenScale to execute applications as containers. This requires me to package each application included in the benchmark as a container. I use state-of-art DNA-Seq & RNA-Seq alignment workflows as references for the list of applications & workflow structure. The container images & source files I created are publicy available in GitHub (Deliverables #1)

I also prepare sample inputs for each application to ease the burden of users who do not have sufficient familiarity with genomics applications. The effort is not trivial, because in a workflow, the inputs for a certain step depend on the outputs of previous step(s). Simply speaking, to prepare inputs for the last application in a workflow, we need to get the outputs of applications executed before it, which also requires the outputs of another set of applications, and so on until we arrive at the beginning of workflow. This translates into significant manual labor of carefully tracing & collecting intermediate files from each step of the reference workflows.

All inputs are hosted in a public Google Drive and ChameleonCloud object store (Deliverables #2). In total, I prepared containers and inputs for 7 popular genomics applications: BWA, FastQC, Fastq Cleaner, GATK, Picard, STAR, and Trimmomatic.


Figure 1. Production-grade softwares used in GenScale: Kubernetes for task orchestration, and Prometheus + Grafana for real-time resource monitoring.

2. Components Development.

In this phase, GenScale main components were developed. GenScale consists of three components: (a) Workflow Manager, (b) Task Orchestrator, and (c) Resource Monitor. The Workflow Manager is built from scratch to allow high degree of freedom when scheduling workflows. I use industry-grade solutions for the other components, namely Kubernetes for orchestrating tasks / containers, and Prometheus + Grafana for real-time resource monitoring. My deliverables include semi-automatic installation scripts & easy-to-follow instructions to set up all three components. (Deliverables #3)

3. Performance Data Generation.

The last phase is to use GenScale prototype to generate performance data of each application. I focused on collecting data for three types of resources: compute (CPU utilization), memory (resident set size), and I/O (read & write operations over time). GenScale export these information into a single CSV file to facilitate easy analysis. My deliverables include performance data for DNA-Seq and RNA-Seq workflows. I also provide a sample Python Notebook which analyzes the CPU utilization pattern of each application in DNA-Seq workflow. (Deliverables #4)


Figure 2. CPU utilization pattern of 9 applications in DNA-Seq Alignment workflow collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

Deliverables

This project’s deliverables can be found in the following Github repo: https://github.com/martinluttap/sor24-genscale/tree/main. In summary, the deliverables include:

Container Images
Input Dataset
Source Code
Performance Data & Sample Analysis Notebook

Future Works, Broader Impacts

Understanding workload characteristics is a crucial step for designing efficient scheduling policy & resource management techniques. GenScale and the performance data it can generate might be a starting point for such effort. Furthermore, I hope GenScale will catalyze meaningful engagements between the computer systems community and bioinformatics community. I believe state-of-arts systems techniques can greatly aid the computing efforts among bioinformatics community. Similarly, domain-specific knowledge & problems within bioinformatics provide unique grounds for the systems community to further advance their field.

Halfway Through SoR24: Building a Scalable Performance Benchmarking Tool for Genomics Workflows

Sun, 21 Jul 2024 00:00:00 +0000

Project Overview

Hi! I’m Martin Putra, and I’m working on the “Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster” project under the supervision of In Kee Kim. We are building GenScale, a scalable benchmarking tool for genomics workfload which leverages industrial-grade cluster manager and monitoring systems. GenScale will allow us to generate performance data under a setup that is representative of large-scale production settings. Ultimately, we hope GenScale and the datasets it produces will catalyze engagement between the computer systems and bioinformatics community, thus accelerating the pace of discovery at both fields.

Progress and Challenges

We have built a prototype using Kubernetes as cluster manager and Prometheus for monitoring systems. At its current state, the prototype can support an arbitrary number of compute nodes, owing to Kubernetes’ notable scaling capability. This provides a suitable environment for small- to mid-scale experiments. We leverage ChameleonCloud to provide the necessary computational and reproducibility infrastructure. The monitoring system supports cluster-level, node-level, and container-level metrics collection and failure detection. We integrated Grafana dashboards for visualizations.

The prototype also supports the execution of user-defined workflows. During the design process, we considered integrating one of existing workflow execution systems, such as cwltool, Nextflow, or Cromwell. Each system has its own pros and cons when placed within the context of how we envision GenScale. However, we ultimately decided to build our own workflow execution system in order to provide maximum flexibility for the capabilities we plan to add in the future. For example, we believe it will be interesting to study how hardware heterogeneity affects the performance of each application in the workflow (a well-known workflow scheduling problem). Studying the problem requires capability to schedule execution on specific machines. In addition, if we want to study contention, we may need to execute on machines which are currently running specific workflows, too. While there are ways to do them with existing workflow execution systems + Kubernetes stack, we believe it will be hugely simplified if we build our own workflow execution system.


Figure 1. Proportion of execution time for DNA Alignment applications, executed on Chameleon’s cascadelake_r node with 1500MB paired-end input. y-axis: proportion of application’s exec. time out of the whole workflow’s exec. time, x-axis: top 10 applications accounting for 97% exec. time, sorted by proportion. Other applications are aggregated.

We confirmed GenScale’s capability to produce useful data by executing a DNA alignment workflow and capturing its runtime resource usage. We use Genomics Data Commons’ (GDC) DNA alignment workflow as reference, which has a total of 27 applications ranging from quality check, read trimming, actual alignment, indexing, and various metrics collection. We wrote our own simplified version of the workflow by first analyzing the execution time & resource usage of each application, then we chose 10 applications which represents 97% of the workflow execution time. We took into account that containerization is the de-facto standard for workflow execution among the bioinformatics community. Thus, we packaged each application as its own separate container, then hosted their Dockerfiles & containers in a private Github Container Registry (GHCR). We plan to make them public in the future. Our monitoring system is able to show resource usage in real time. We also built sidecar containers which use Unix’s pidstats to generate a CSV of cores, memory, and storage utilization throughout each workflow’s execution. This will allow easier analysis and data sharing for GenScale’s users.


Figure 2. CPU utilization pattern of BWA, Picard’s CollectWGSMetrics, and Picard’s ValidateSamFile collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

One technical challenge is in automating the creation of Kubernetes cluster and in keeping it alive. We believe GenScale’s users would be interested in the performance of workflows under dynamic cluster sizes, either due to intentional scaling or machine failures. While the current prototype supports creating a cluster with arbitrary nodes, there are still steps which require a reboot when adding nodes. This makes cluster creation and horizontal scaling not fully automated yet. Keeping a cluster alive is also expensive. Since we use ChameleonCloud as our testbed, we have a choice of either keeping the cluster alive at the cost of significant service units (SU) usage, or save SUs by terminating our leases at the cost of rebuilding the cluster from scratch later. We choose a middle ground by keeping only Kubernetes’ control plane alive. The approach works well so far.

Next Steps

For the remaining weeks, we plan to work on the second workflow, namely RNA Alignment. We would also like to add simple user interfaces if time permits. Finally, we plan to package GenScale’s source code, container images, and sample benchmark results for the open-source community. We look forward to the second half of Summer of Reproducibility!

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Wed, 12 Jun 2024 00:00:00 +0000

Hi! I’m Martin, and I will be working on Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster under the mentorship of In Kee Kim. Our work is driven by the scale of computing systems that hosts data commons – we believe that performance characterization of genomics workload should be done rapidly and at the scale similar to production settings. Feel free to check our proposal for more details!

We propose GenScale, a genomics workload benchmarking tool which can achieve both the scale and speed necessary for characterizing performance under large-scale settings. GenScale will be built on top of industrial-grade cluster manager (e.g. Kubernetes), metrics collection & monitoring systems (e.g. Prometheus), and will support comprehensive set of applications used in state-of-art genomics workflows. Initial version developed during this project will include DNA and RNA alignment workflows.

Finally, we believe that open access and reproducible research will greatly accelerate the pace of scientific discovery. We aim to package our artefacts and generated datasets in ways that makes it easiest to replicate, analyze, and build upon. I personally look forward to learn from & contribute to the open source community!