<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Charis Christopher Hulu | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/charis-christopher-hulu/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/charis-christopher-hulu/index.xml" rel="self" type="application/rss+xml"/><description>Charis Christopher Hulu</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/charis-christopher-hulu/avatar_hue3036b44a941eff207916dc584021e57_400656_270x270_fill_q75_lanczos_center.jpg</url><title>Charis Christopher Hulu</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/charis-christopher-hulu/</link></image><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230803-charishulu/</link><pubDate>Thu, 03 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230803-charishulu/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/">Reproducible Analysis &amp;amp; Models for Predicting Genomics Workflow Execution Time&lt;/a>, our goal was to characterize the tools on genomic workflows in terms of system metrics and data quality to build machine learning models to predict the elapsed time of genomic workflows. While Shayantan (another contributor) did the analysis on data quality metrics, I contributed to the system metrics analysis. We are getting closer to that goal because we have managed to collect datasets and do some analysis.&lt;/p>
&lt;h2 id="steps">Steps&lt;/h2>
&lt;p>In this project, we selected DNA-Seq Pipeline as the workflow to be analyzed. This pipeline consists of four tools for processing single-end reads, namely BWA-mem, Samtool-view, Picard-SortSam, Picard-MarkDuplicates. For each tool we executed it using various configurations and stored system metrics for each execution. To do this, we have to take two steps:&lt;/p>
&lt;ul>
&lt;li>Step 1: Building the tools execution environment.&lt;/li>
&lt;li>Step 2: Developing a program to execute tools using some configurations and collect runtime parameters (eg. CPU, RSS, VSZ, and IO) automatically.&lt;/li>
&lt;/ul>
&lt;h2 id="execution-environment">Execution Environment&lt;/h2>
&lt;p>Tools are executed on Chameleon instances by submitting them using Slurm. The machine used in collecting system metrics is a Haswell instance of the Chameleon Texas server. This instance uses Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with following detailed specifications.&lt;/p>
&lt;table>
&lt;tr>
&lt;th>Number of CPUs&lt;/th>
&lt;td>48&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Number of threads per core&lt;/th>
&lt;td>2&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Number of cores per socket&lt;/th>
&lt;td>12&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Number of sockets&lt;/th>
&lt;td>2&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>In this experiment, we use n+1 instances, where there are n compute nodes and 1 master node. Each execution is done by submitting a job, which is a tool with a certain configuration, from a master node and it will be processed by one of the compute nodes. In order for the tool to be executed, we need to set the master node to be a common container using NFS. This common container is used to store input files and commands for executing tools so that all nodes can access them without having to download and install them.&lt;/p>
&lt;h2 id="executing-and-collecting-system-metrics">Executing and Collecting System Metrics&lt;/h2>
&lt;p>Tools will be executed in various specific configurations by varying parameters such as input size, number of CPU allocation, memory allocation and threads. For example, for BWA-mem respectively the number of variations in values for the number of CPU allocations, memory allocations, and threads is 5, 4, and 5 using 10 different files so that there are 5 x 4 x 5 x 10 = 1000 configuration combinations. For each configuration will be executed 8 times so that there are 8000 data points. Configuration details can be seen in the following table.&lt;/p>
&lt;table>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>#repetions&lt;/th>
&lt;th>#files&lt;/th>
&lt;th>#allocated CPU&lt;/th>
&lt;th>#allocated memory&lt;/th>
&lt;th>#threads&lt;/th>
&lt;th>total&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>BWA-mem&lt;/th>
&lt;td>8&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Samtool-view&lt;/th>
&lt;td>10&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>-&lt;/td>
&lt;td>2000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Picard-Sortsam&lt;/th>
&lt;td>10&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>-&lt;/td>
&lt;td>2000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Picard-MarkDuplicates&lt;/th>
&lt;td>10&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>-&lt;/td>
&lt;td>2000&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>Meanwhile, to run the tools, we use the following commands:&lt;/p>
&lt;ul>
&lt;li>BWA-mem&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="nv">$BWA&lt;/span> mem -t &lt;span class="nv">$threads&lt;/span> &lt;span class="nv">$REF_DIR&lt;/span>/hg19.fa &lt;span class="si">${&lt;/span>&lt;span class="nv">INPUT_DIR&lt;/span>&lt;span class="si">}&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>*.fastq &amp;gt; &lt;span class="si">${&lt;/span>&lt;span class="nv">OUTPUT_DIR&lt;/span>&lt;span class="si">}&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.sam
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Samtool-view&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="nv">$SAMTOOLS&lt;/span> view &lt;span class="nv">$INPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.sam -Shb -o &lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Picard-SortSam&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">java -jar &lt;span class="nv">$PICARD&lt;/span> SortSam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">CREATE_INDEX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nb">true&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">INPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$INPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">OUTPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">SORT_ORDER&lt;/span>&lt;span class="o">=&lt;/span>coordinate &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">VALIDATION_STRINGENCY&lt;/span>&lt;span class="o">=&lt;/span>STRICT
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Picard-MarkDuplicates&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">java -jar &lt;span class="nv">$PICARD&lt;/span> MarkDuplicates &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">CREATE_INDEX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nb">true&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">INPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$INPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">OUTPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">METRICS_FILE&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>_rmd.txt &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">VALIDATION_STRINGENCY&lt;/span>&lt;span class="o">=&lt;/span>STRICT
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In Slurm, each job has a job id. In addition, there is a &lt;code>scontrol listpids&lt;/code> command to see the job id to PID mapping. Using this, we can obtain system metrics for a job by gathering information from the &lt;code>/proc/$PID&lt;/code> system file. Information that can be obtained from it is the use of CPU, physical memory, virtual memory, read bytes, and write bytes at a particular time. So that in collecting this data, we will record these features along with the timestamp at 1 second intervals throughout the execution process.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>We also have calculated the correlation for each feature with the elapsed time. For BWA-mem, the features that correlate more than absolute of 0.5 are Input size, Average CPU Usage, and Output file size , which is in SAM format. For samtools there are input size, average cpu usage and output size in BAM.
For Sortsam, there are input size, write operation, and BAM output size. For MarkDuplicates, there are input size and BAM output size.&lt;/p>
&lt;table>
&lt;tr>
&lt;th>Features\Tools&lt;/th>
&lt;th>BWA-mem&lt;/th>
&lt;th>Samtool-view&lt;/th>
&lt;th>Picard-SortSam&lt;/th>
&lt;th>Picard-MarkDuplicates&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>Allocated CPU&lt;/th>
&lt;td>-0.145&lt;/td>
&lt;td>-0.095&lt;/td>
&lt;td>-0.179&lt;/td>
&lt;td>-0.156&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Allocated physical memory&lt;/th>
&lt;td>-0.010&lt;/td>
&lt;td>-0.038&lt;/td>
&lt;td>-0.069&lt;/td>
&lt;td>0.132&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Input size&lt;/th>
&lt;td>&lt;b>0.583&lt;/b>&lt;/td>
&lt;td>&lt;b>0.651&lt;/b>&lt;/td>
&lt;td>&lt;b>0.937&lt;/b>&lt;/td>
&lt;td>&lt;b>0.922&lt;/b>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Threads&lt;/th>
&lt;td>-0.072&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Average CPU&lt;/th>
&lt;td>&lt;b>-0.607&lt;/b>&lt;/td>
&lt;td>&lt;b>-0.567&lt;/b>&lt;/td>
&lt;td>-0.479&lt;/td>
&lt;td>-0.480&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Peak CPU&lt;/th>
&lt;td>-0.175&lt;/td>
&lt;td>0.174&lt;/td>
&lt;td>-0.170&lt;/td>
&lt;td>0.046&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Average RSS&lt;/th>
&lt;td>0.040&lt;/td>
&lt;td>0.034&lt;/td>
&lt;td>0.131&lt;/td>
&lt;td>0.182&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Peak RSS&lt;/th>
&lt;td>0.068&lt;/td>
&lt;td>0.046&lt;/td>
&lt;td>0.314&lt;/td>
&lt;td>0.175&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Average VSZ&lt;/th>
&lt;td>0.032&lt;/td>
&lt;td>-0.349&lt;/td>
&lt;td>-0.127&lt;/td>
&lt;td>0.090&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Peak VSZ&lt;/th>
&lt;td>0.048&lt;/td>
&lt;td>0.074&lt;/td>
&lt;td>-0.130&lt;/td>
&lt;td>0.088&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Write bytes&lt;/th>
&lt;td>0.037&lt;/td>
&lt;td>0.190&lt;/td>
&lt;td>&lt;b>0.735&lt;/b>&lt;/td>
&lt;td>0.244&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Read bytes&lt;/th>
&lt;td>-0.031&lt;/td>
&lt;td>0.109&lt;/td>
&lt;td>0.070&lt;/td>
&lt;td>0.110&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Output SAM size&lt;/th>
&lt;td>&lt;b>0.589&lt;/b>&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Output BAM size&lt;/th>
&lt;td>-&lt;/td>
&lt;td>&lt;b>0.763&lt;/b>&lt;/td>
&lt;td>&lt;b>0.934&lt;/b>&lt;/td>
&lt;td>&lt;b>0.923&lt;/b>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Output BAI size&lt;/th>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;td>0.400&lt;/td>
&lt;td>0.399&lt;/td>
&lt;/tr>
&lt;/table>
&lt;h2 id="future-works">Future Works&lt;/h2>
&lt;p>For further work, we will analyze the correlation between elapsed time and features whose scores are below an absolute 0.5. Because there is a possibility that these features are actually correlated with the elapsed time but do not appear to be correlated because the measurements are made by calculating the overall data. So we also need to calculate the feature correlation for each data grouped by input file. Then, we create a machine learning model to predict elapsed time.&lt;/p></description></item><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230616-charishulu/</link><pubDate>Fri, 16 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230616-charishulu/</guid><description>&lt;p>Hi! I&amp;rsquo;m Charis, an undergraduate student in the IT and Big Data Analytics program at the Calvin Institute of Technology. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/">Reproducible Analysis &amp;amp; Models for Predicting Genomics Workflow Execution Time&lt;/a> my &lt;a href="https://drive.google.com/file/d/1dFkC2A0HUVaWd6NpCbTjRZVfYxQ7jRxJ/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/in-kee-kim/">In Kee Kim&lt;/a> and &lt;strong>Martin Putra&lt;/strong> aims to gain insight into features that are highly correlated with execution times of genomics workflows and build machine learning models for predicting workflow execution time.&lt;/p>
&lt;p>Genomics workflows exhibit a long-tail pattern in their execution times. According to the previous project team&amp;rsquo;s findings, approximately 2% of genomics workflows had a median execution time of up to 15%, resulting in weeks of execution. Interestingly, it was observed that input quality plays a role in these execution time differences. Therefore, we will analyze features such as the quality of input data as well as the amount of resources allocated in the execution of genomics workflows to find features that correlate with execution time. Based on these features we will build a machine learning model that can predict the execution time of genomics workflows.&lt;/p>
&lt;p>By collaborating with Shayantan Banerjee (another contributor) who will study data quality, I will study the system metrics of genomics workflows both at workflow-level and tool-level. Metrics will be collected by running genomics workflows using the Slurm workload manager under various resource allocation conditions. Genomics workflows will be executed on Chameleon clusters of different sizes.&lt;/p></description></item></channel></rss>