Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Tue, 01 Aug 2023 00:00:00 +0000

We are currently midway into the OSRE 2023 program and the following post lists the progress that I have made on the project so far. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time our overall goal was to enumerate the effect of sequence data quality on execution times. Towards that end, we decided to first identify suitable datasets from the two commmonly available -omics data modalities - transcriptomics and genomics. Albrecht et al. [1] developed seqQscorer to automate the quality control step of NGS data analysis through predictive modeling. They have also published the list of ENCODE datasets used for training the models. Quality label has been assigned as 0 for released files or 1 for revoked files. Based on the guidelines set forth by ENCODE’s Data Coordination Centre (DCC) a comprehensive manual annotation of the data was done by scientists and the resulting quality variable “status” was published to serve as an indication of the quality of the data. The following steps outline the process of generating the data table for building the machine learning models.

Step 1: Programmatically accessed 86 (34 released ; 34 revoked) RNA-seq files from ENCODE database. All the fastq files were single ended.
Step 2: Programmatically accessed 288 (144 released ; 144 revoked) DNA-seq files from ENCODE database. All the fastq files were paired ended.
Step 3: Implemeted the STAR aligner for RNA-seq and the BWA aligner for DNA seq. The resulting outputs contained the alignment times for both the “revoked” and “released”.
Step 4: Ran statistical tests to determine whether there is any significant differences in the runtimes of both types of files.

Currently I am running the FASTQC tool to extract data quality metrics for the same set of files as discsussed above. Once I have collected those metrics, I can start building regression models to determine whether there is any significant impact of data quality on execution time. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Through our analysis we aim to develop a reproducible ML model that will give the user an estimate of the runtime based on the raw FATSQ file as input.

References

[1] Albrecht, S., Sprang, M., Andrade-Navarro, M.A. et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning. Genome Biol 22, 75 (2021). https://doi.org/10.1186/s13059-021-02294-2

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Wed, 12 Jul 2023 00:00:00 +0000

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim, Martin Putra and collaborator Charis Christopher Hulu (another OSRE fellow) aims to analyze large-scale sequencing datasets in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times.
Recent advancements in Next-Generation Sequencing (NGS) technologies have resulted in massive amounts of nucleotide sequence data and automated genomic workflows to streamline analysis and data interpretation. The success of NGS-driven research has also led to a sudden increase in data of varying size and complexity, making it more time-consuming for researchers to test hypotheses. Analyzing high-throughput genomic data requires a step-by-step execution of dedicated tools - also known as workflows. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Prior work in this area has suggested that the runtimes of genomic workflows get affected due to qualitative differences in the data. Additionally, there is very little consensus on what constitutes “input quality” regarding data from large genomic experiments. In this proposal, we hypothesize that genomic data quality significantly impacts the genomic workflows’ execution time. We aim to leverage machine learning techniques to extract predictive features from quality control tools that robustly predict workflow execution time.

Shayantan Banerjee | UCSC OSPO

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

References

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time