Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Thu, 03 Aug 2023 00:00:00 +0000

Introduction

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time, our goal was to characterize the tools on genomic workflows in terms of system metrics and data quality to build machine learning models to predict the elapsed time of genomic workflows. While Shayantan (another contributor) did the analysis on data quality metrics, I contributed to the system metrics analysis. We are getting closer to that goal because we have managed to collect datasets and do some analysis.

Steps

In this project, we selected DNA-Seq Pipeline as the workflow to be analyzed. This pipeline consists of four tools for processing single-end reads, namely BWA-mem, Samtool-view, Picard-SortSam, Picard-MarkDuplicates. For each tool we executed it using various configurations and stored system metrics for each execution. To do this, we have to take two steps:

Step 1: Building the tools execution environment.
Step 2: Developing a program to execute tools using some configurations and collect runtime parameters (eg. CPU, RSS, VSZ, and IO) automatically.

Execution Environment

Tools are executed on Chameleon instances by submitting them using Slurm. The machine used in collecting system metrics is a Haswell instance of the Chameleon Texas server. This instance uses Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with following detailed specifications.

Number of CPUs	48
Number of threads per core	2
Number of cores per socket	12
Number of sockets	2

In this experiment, we use n+1 instances, where there are n compute nodes and 1 master node. Each execution is done by submitting a job, which is a tool with a certain configuration, from a master node and it will be processed by one of the compute nodes. In order for the tool to be executed, we need to set the master node to be a common container using NFS. This common container is used to store input files and commands for executing tools so that all nodes can access them without having to download and install them.

Executing and Collecting System Metrics

Tools will be executed in various specific configurations by varying parameters such as input size, number of CPU allocation, memory allocation and threads. For example, for BWA-mem respectively the number of variations in values for the number of CPU allocations, memory allocations, and threads is 5, 4, and 5 using 10 different files so that there are 5 x 4 x 5 x 10 = 1000 configuration combinations. For each configuration will be executed 8 times so that there are 8000 data points. Configuration details can be seen in the following table.

	#repetions	#files	#allocated CPU	#allocated memory	#threads	total
BWA-mem	8	10	2, 4, 8, 16, 32	8, 16, 32, 64	2, 4, 8, 16, 32	8000
Samtool-view	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-Sortsam	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-MarkDuplicates	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000

Meanwhile, to run the tools, we use the following commands:

BWA-mem

$BWA mem -t $threads $REF_DIR/hg19.fa ${INPUT_DIR}/${sra_id}*.fastq > ${OUTPUT_DIR}/${sra_id}.sam

Samtool-view

$SAMTOOLS view $INPUT_DIR/${sra_id}.sam -Shb -o $OUTPUT_DIR/${sra_id}.bam

Picard-SortSam

java -jar $PICARD SortSam \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=STRICT

Picard-MarkDuplicates

java -jar $PICARD MarkDuplicates \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
METRICS_FILE=$OUTPUT_DIR/${sra_id}_rmd.txt \
VALIDATION_STRINGENCY=STRICT

In Slurm, each job has a job id. In addition, there is a scontrol listpids command to see the job id to PID mapping. Using this, we can obtain system metrics for a job by gathering information from the /proc/$PID system file. Information that can be obtained from it is the use of CPU, physical memory, virtual memory, read bytes, and write bytes at a particular time. So that in collecting this data, we will record these features along with the timestamp at 1 second intervals throughout the execution process.

Results

We also have calculated the correlation for each feature with the elapsed time. For BWA-mem, the features that correlate more than absolute of 0.5 are Input size, Average CPU Usage, and Output file size , which is in SAM format. For samtools there are input size, average cpu usage and output size in BAM. For Sortsam, there are input size, write operation, and BAM output size. For MarkDuplicates, there are input size and BAM output size.

Features\Tools	BWA-mem	Samtool-view	Picard-SortSam	Picard-MarkDuplicates
Allocated CPU	-0.145	-0.095	-0.179	-0.156
Allocated physical memory	-0.010	-0.038	-0.069	0.132
Input size	0.583	0.651	0.937	0.922
Threads	-0.072	-	-	-
Average CPU	-0.607	-0.567	-0.479	-0.480
Peak CPU	-0.175	0.174	-0.170	0.046
Average RSS	0.040	0.034	0.131	0.182
Peak RSS	0.068	0.046	0.314	0.175
Average VSZ	0.032	-0.349	-0.127	0.090
Peak VSZ	0.048	0.074	-0.130	0.088
Write bytes	0.037	0.190	0.735	0.244
Read bytes	-0.031	0.109	0.070	0.110
Output SAM size	0.589	-	-	-
Output BAM size	-	0.763	0.934	0.923
Output BAI size	-	-	0.400	0.399

Future Works

For further work, we will analyze the correlation between elapsed time and features whose scores are below an absolute 0.5. Because there is a possibility that these features are actually correlated with the elapsed time but do not appear to be correlated because the measurements are made by calculating the overall data. So we also need to calculate the feature correlation for each data grouped by input file. Then, we create a machine learning model to predict elapsed time.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Fri, 16 Jun 2023 00:00:00 +0000

Hi! I’m Charis, an undergraduate student in the IT and Big Data Analytics program at the Calvin Institute of Technology. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim and Martin Putra aims to gain insight into features that are highly correlated with execution times of genomics workflows and build machine learning models for predicting workflow execution time.

Genomics workflows exhibit a long-tail pattern in their execution times. According to the previous project team’s findings, approximately 2% of genomics workflows had a median execution time of up to 15%, resulting in weeks of execution. Interestingly, it was observed that input quality plays a role in these execution time differences. Therefore, we will analyze features such as the quality of input data as well as the amount of resources allocated in the execution of genomics workflows to find features that correlate with execution time. Based on these features we will build a machine learning model that can predict the execution time of genomics workflows.

By collaborating with Shayantan Banerjee (another contributor) who will study data quality, I will study the system metrics of genomics workflows both at workflow-level and tool-level. Metrics will be collected by running genomics workflows using the Slurm workload manager under various resource allocation conditions. Genomics workflows will be executed on Chameleon clusters of different sizes.

Charis Christopher Hulu | UCSC OSPO