<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>computer systems | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/computer-systems/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/computer-systems/index.xml" rel="self" type="application/rss+xml"/><description>computer systems</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Wed, 07 Feb 2024 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>computer systems</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/computer-systems/</link></image><item><title>LAST: Let’s Adapt to System Drift</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/anl/last/</link><pubDate>Wed, 07 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/anl/last/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Computer systems, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, Bash scripting, Linux, Data Science and Machine Learning&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/ray-andrew-sinurat/">Ray Andrew Sinurat&lt;/a> (primary contact), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/sandeep-madireddy/">Sandeep Madireddy&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The performance of computer systems is constantly evolving, a natural outcome of updating hardware, improving software, and encountering hardware quirks over time. At the same time, machine learning (ML) models are becoming increasingly popular. They are being used widely to address various challenges in computer systems, notably in speeding up decision-making. This speed is vital for a quick and flexible response, essential for meeting service-level agreements (SLAs). Yet, an interesting twist has emerged: like the computer systems they aid, ML models also experience a kind of &amp;ldquo;aging.&amp;rdquo; This results in a gradual decline in their effectiveness, a consequence of changes in their operating environment.&lt;/p>
&lt;p>The phenomenon of model &amp;ldquo;aging&amp;rdquo; is a ubiquitous occurrence across various domains, not limited merely to computer systems. This process of aging can significantly impact the performance of a model, emphasizing the critical importance of early detection mechanisms to maintain optimal functionality. In light of this, numerous strategies have been formulated to mitigate the aging of models. However, the generalizability and effectiveness of these strategies across diverse domains, particularly in computer systems, remain largely unexplored. This research aims to bridge this gap by designing and implementing a comprehensive data analysis pipeline. The primary objective is to evaluate the efficacy of various strategies through a comparative analysis, focusing on their performance in detecting and addressing model aging. To achieve a better understanding of this issue, the research will address the following pivotal questions:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Data-Induced Model Aging&lt;/strong>: What specific variations within the data can precipitate the aging of a model? Understanding the nature and characteristics of data changes that lead to model deterioration is crucial for developing effective prevention and mitigation strategies.&lt;/li>
&lt;li>&lt;strong>Efficacy of Aging Detection Algorithms&lt;/strong>: How proficient are the current algorithms in identifying the signs of model aging? Assessing the accuracy and reliability of these algorithms will provide insights into their practical utility in real-world scenarios.&lt;/li>
&lt;li>&lt;strong>Failure Points in Detection&lt;/strong>: In what scenarios or under what data conditions do the aging detection mechanisms fail? Identifying the limitations and vulnerabilities of these algorithms is vital for refining their robustness and ensuring comprehensive coverage.&lt;/li>
&lt;li>&lt;strong>Scalability and Responsiveness&lt;/strong>: How do these algorithms perform in terms of robustness and speed, particularly when subjected to larger datasets? Evaluating the scalability and responsiveness of the algorithms will determine their feasibility and effectiveness in handling extensive and complex datasets, a common characteristic in computer systems.&lt;/li>
&lt;/ul>
&lt;p>To better understand and prevent issues related to model performance, our approach involves analyzing various datasets, both system and non-system, that have shown notable changes over time. We aim to apply machine learning (ML) models to these datasets to assess the effects of these changes on model performance. Our goal is to leverage more advanced ML techniques to create new algorithms that address these challenges effectively. This effort is expected to contribute significantly to the community, enhancing the detection of model aging and improving model performance in computer systems.&lt;/p>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Run pipeline on several computer systems and non-computer systems dataset&lt;/li>
&lt;li>A Trovi artifact for data preprocessing and model training shared on Chameleon Cloud&lt;/li>
&lt;li>A GitHub repository containing the pipeline source code&lt;/li>
&lt;/ul></description></item><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/</link><pubDate>Thu, 02 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/</guid><description>&lt;p>A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.&lt;/p>
&lt;p>On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.&lt;/p>
&lt;p>In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.&lt;/p>
&lt;h3 id="analyze-genomics-data-quality--build-exec-time-prediction-models">Analyze genomics data quality &amp;amp; build exec. time prediction models&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> genomics, data analysis, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Linux, Python, Matplotlib, Pandas/Numpy, any ML library&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/in-kee-kim/">In Kee Kim&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/charis-christopher-hulu/">Charis Christopher Hulu&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that
correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Acquire basic understanding of genomics data processing &amp;amp; workflow execution (will be guided by the mentor)&lt;/li>
&lt;li>Reproduce past analysis &amp;amp; models built by prior members of the project&lt;/li>
&lt;li>Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior&lt;/li>
&lt;li>Write a brief analysis as to why those features might work&lt;/li>
&lt;li>Build ML models for predicting execution time&lt;/li>
&lt;li>Package the analysis in the form of Jupyter notebooks&lt;/li>
&lt;li>Package the models in a reloadable format (e.g., pickle)&lt;/li>
&lt;/ul></description></item></channel></rss>