<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Yuyang (Roy) Huang | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/index.xml" rel="self" type="application/rss+xml"/><description>Yuyang (Roy) Huang</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/avatar_huc2a18e46e27f5ea73656c466f65d9060_27070_270x270_fill_q75_lanczos_center.jpg</url><title>Yuyang (Roy) Huang</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/</link></image><item><title>FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/fep_bench/</link><pubDate>Mon, 03 Jun 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/fep_bench/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, Bash scripting, Linux, Machine Learning modeling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/">Yuyang (Roy) Huang&lt;/a> (primary contact), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/swami-sundararaman/">Swami Sundararaman&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/lihaowen-jayce-zhu/">Lihaowen (Jayce) Zhu&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>In the realm of machine learning (ML), preprocessing of data is a critical yet often underappreciated phase, consuming approximately 80% of the time in common ML tasks. This extensive time consumption can be attributed to various challenges encountered from both data and computation perspectives.&lt;/p>
&lt;p>From the data side, one significant challenge is the slow retrieval of data from data lakes, which are storage repositories that hold a vast amount of raw data in its native format. However, the process of extracting this data can be slow, causing computation cycles to wait for data arrival and leading to delays in the entire preprocessing phase. Furthermore, the size of the data often exceeds the memory capacity of standard computing systems. This is a frequent occurrence in ML, as datasets are typically large and complex. Handling such large datasets requires sophisticated memory management techniques to ensure efficient preprocessing without overwhelming the system&amp;rsquo;s memory.&lt;/p>
&lt;p>On the computation side, a naive solution to data operations, especially aggregation, often leads to inefficiencies. These operations may require grouping a large chunk of data as a prerequisite before performing any actual computation. This grouping, without careful configuration and management, can trigger serious data shuffling, leading to extensive remote data movement when the data is distributed across various storage systems. Such data movement is not only time-consuming but also resource-intensive.&lt;/p>
&lt;p>To mitigate these challenges, there is a pressing need to design better caching, prefetching, and heuristic strategies for data preprocessing. The team aims to significantly reduce the time and resources required for preprocessing by optimizing data retrieval and computational processes.&lt;/p>
&lt;p>However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:&lt;/p>
&lt;ul>
&lt;li>Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.&lt;/li>
&lt;li>Collect the common datasets used for different types of ML models.&lt;/li>
&lt;li>Collect the typical operations used for preprocessing these datasets.&lt;/li>
&lt;li>Benchmark the performance in these operations under the existing frameworks under various experimental settings.&lt;/li>
&lt;li>Package the benchmark such that the team can later use it for reproduction or evaluation.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>A rolodex for the commonly used dataset and corresponding preprocess operations and expected output formats/types&lt;/li>
&lt;li>A Chameleon Trovi package that preprocess the dataset with single-machine preprocessing framework like pandas&lt;/li>
&lt;li>A Chameleon Trovi package that preprocess the dataset in an existing distributed computation framework like Hadoop or Spark&lt;/li>
&lt;/ul></description></item><item><title>EdgeRep: Reproducing and benchmarking edge analytic systems</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/edgerep/</link><pubDate>Fri, 02 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/edgerep/</guid><description>&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> video analytics, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, Bash scripting, Linux, Machine Learning modeling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/">Yuyang (Roy) Huang&lt;/a> (contact person), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/junchen-jiang/">Junchen Jiang&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;p>With the flourishing of ideas like smart cities and smart manufacturing, a
massive number of edge devices (e.g., traffic or security cameras,
thermometers, flood sensors, etc.) are deployed and connected to the network.
These devices collect and analyze data across space and time, aiding
stakeholders like city governments and manufacturers in optimizing their plans
and operations. However, the sheer number of edge devices and the large amount
of communication among the devices and central servers raises significant
challenges in how to manage and schedule resources. This includes network
bandwidth between the devices and computing power on both edge devices and bare
metal servers, all to maintain the reliable service capability of running
applications.&lt;/p>
&lt;p>Moreover, given the limited resources available to edge devices, there&amp;rsquo;s an
emerging trend to reduce average compute and/or bandwidth usage. This is
achieved by leveraging the uneven distribution of interesting events with
respect to both time and space in the input data. This, in turn, introduces
further challenges in provisioning and managing the amount of resources
available to edge devices. The resource demands of running applications can
greatly depend on the input data, which is both dynamic and unpredictable.&lt;/p>
&lt;p>Keeping these challenges in mind, the team previously designed and implemented
a dynamic resource manager capable of understanding the applications and making
decisions based on this understanding at runtime. However, such a resource
manager has only been tested with a limited number and types of video analytic
applications. Thus, through the OSRE24 project, we aim to:&lt;/p>
&lt;ul>
&lt;li>Collect a wide range of videos to form a comprehensive video dataset&lt;/li>
&lt;li>Reproduce other state-of-art self-adaptive video analytic applications&lt;/li>
&lt;li>Package the dataset as well as the application to publish them on Chameleon
Trovi site&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Collect a wide range of videos to form a comprehensive video dataset&lt;/li>
&lt;li>Reproduce other state-of-art self-adaptive video analytic applications&lt;/li>
&lt;li>Package the dataset as well as the application to publish them on Chameleon
Trovi site&lt;/li>
&lt;/ul></description></item><item><title>FEP-Bench: Benchmarks for understanding featuring engineering and preprocessing bottlenecks</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ibm/fep-bench/</link><pubDate>Fri, 02 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ibm/fep-bench/</guid><description>&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> storage system, scheduling, distributed system, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, Bash scripting, Linux, Machine Learning modeling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/">Yuyang (Roy) Huang&lt;/a> (contact person), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/swami-sundararaman/">Swami Sundararaman&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;p>In the realm of machine learning (ML), preprocessing of data is a critical yet
often underappreciated phase, consuming approximately 80% of the time in common
ML tasks. This extensive time consumption can be attributed to various
challenges encountered from both data and computation perspectives.&lt;/p>
&lt;p>From the data side, one significant challenge is the slow retrieval of data
from data lakes, which are storage repositories that hold a vast amount of raw
data in its native format. However, the process of extracting this data can be
slow, causing computation cycles to wait for data arrival and leading to delays
in the entire preprocessing phase. Furthermore, the size of the data often
exceeds the memory capacity of standard computing systems. This is a frequent
occurrence in ML, as datasets are typically large and complex. Handling such
large datasets requires sophisticated memory management techniques to ensure
efficient preprocessing without overwhelming the system&amp;rsquo;s memory.&lt;/p>
&lt;p>On the computation side, a naive solution to data operations, especially
aggregation, often leads to inefficiencies. These operations may require
grouping a large chunk of data as a prerequisite before performing any actual
computation. This grouping, without careful configuration and management, can
trigger serious data shuffling, leading to extensive remote data movement when
the data is distributed across various storage systems. Such data movement is
not only time-consuming but also resource-intensive.&lt;/p>
&lt;p>To mitigate these challenges, there is a pressing need to design better
caching, prefetching, and heuristic strategies for data preprocessing. The team
aims to significantly reduce the time and resources required for preprocessing
by optimizing data retrieval and computational processes.&lt;/p>
&lt;p>However, prior to the design and implementation of such a system, a systematic
understanding of the preprocessing workflow is essential. Hence, throughout the
program, the students will need to:&lt;/p>
&lt;ul>
&lt;li>Understand the current system used to preprocess data for ML training, for
example, Hadoop or Spark.&lt;/li>
&lt;li>Collect the common datasets used for different types of ML models.&lt;/li>
&lt;li>Collect the typical operations used for preprocessing these datasets.&lt;/li>
&lt;li>Benchmark the performance in these operations under the existing frameworks
under various experimental settings.&lt;/li>
&lt;li>Package the benchmark such that the team can later use it for reproduction or
evaluation.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Understand the current system used to preprocess data for ML training, for
example, Hadoop or Spark.&lt;/li>
&lt;li>Collect the common datasets used for different types of ML models.&lt;/li>
&lt;li>Collect the typical operations used for preprocessing these datasets.&lt;/li>
&lt;li>Benchmark the performance in these operations under the existing frameworks
under various experimental settings.&lt;/li>
&lt;li>Package the benchmark such that the team can later use it for reproduction or evaluation.&lt;/li>
&lt;/ul></description></item></channel></rss>