<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>storage system | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/storage-system/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/storage-system/index.xml" rel="self" type="application/rss+xml"/><description>storage system</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Wed, 12 Jun 2024 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>storage system</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/storage-system/</link></image><item><title>OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/uchicago/openmlec/202406012-jiajunmao/</link><pubDate>Wed, 12 Jun 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/uchicago/openmlec/202406012-jiajunmao/</guid><description>&lt;p>Hello, I&amp;rsquo;m &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jiajun-mao/">Jiajun Mao&lt;/a>, a BS/MS student at the University of Chicago studying Computer Science. I will be spending this summer working on the project &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ornl/openmlec/">OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/">Meng Wang&lt;/a>
and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjus-george/">Anjus George&lt;/a>, my &lt;a href="https://docs.google.com/document/d/1nYgNlGdl0jUgW8avpu671oRpMoxaZHZPwlDfBNXRVro/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>.&lt;/p>
&lt;p>How to increase data’s durability and reliability while decreasing storage cost have always been interesting topics of research. Erasure coded storage systems in recent years have been seen as strong candidates to replace replications for colder storage tiers. In the paper “Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers”, the authors explored using theory and simulation on how a multiple tiered erasure coded system can out-perform systems using single level erasure codes in areas such as encoding throughput and network bandwidth consumed for repair, addressing a few pain points in adopting erasure coded storage systems. I will be implementing the theoretical and simulation result of this paper by building on top of HDFS and ZFS, and benchmarking the system performance.&lt;/p>
&lt;p>The project will aim to achieve&lt;/p>
&lt;ul>
&lt;li>HDFS understanding the underlying characteristics of ZFS as the filesystem&lt;/li>
&lt;li>HDFS understanding the failure report from ZFS, and use new and special MLEC repair logic to execute parity repair&lt;/li>
&lt;li>ZFS will be able to accept repair data from HDFS to repair a suspended pool caused by catastrophic data corruption&lt;/li>
&lt;/ul></description></item><item><title>FEP-Bench: Benchmarks for understanding featuring engineering and preprocessing bottlenecks</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ibm/fep-bench/</link><pubDate>Fri, 02 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ibm/fep-bench/</guid><description>&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> storage system, scheduling, distributed system, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, Bash scripting, Linux, Machine Learning modeling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/">Yuyang (Roy) Huang&lt;/a> (contact person), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/swami-sundararaman/">Swami Sundararaman&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;p>In the realm of machine learning (ML), preprocessing of data is a critical yet
often underappreciated phase, consuming approximately 80% of the time in common
ML tasks. This extensive time consumption can be attributed to various
challenges encountered from both data and computation perspectives.&lt;/p>
&lt;p>From the data side, one significant challenge is the slow retrieval of data
from data lakes, which are storage repositories that hold a vast amount of raw
data in its native format. However, the process of extracting this data can be
slow, causing computation cycles to wait for data arrival and leading to delays
in the entire preprocessing phase. Furthermore, the size of the data often
exceeds the memory capacity of standard computing systems. This is a frequent
occurrence in ML, as datasets are typically large and complex. Handling such
large datasets requires sophisticated memory management techniques to ensure
efficient preprocessing without overwhelming the system&amp;rsquo;s memory.&lt;/p>
&lt;p>On the computation side, a naive solution to data operations, especially
aggregation, often leads to inefficiencies. These operations may require
grouping a large chunk of data as a prerequisite before performing any actual
computation. This grouping, without careful configuration and management, can
trigger serious data shuffling, leading to extensive remote data movement when
the data is distributed across various storage systems. Such data movement is
not only time-consuming but also resource-intensive.&lt;/p>
&lt;p>To mitigate these challenges, there is a pressing need to design better
caching, prefetching, and heuristic strategies for data preprocessing. The team
aims to significantly reduce the time and resources required for preprocessing
by optimizing data retrieval and computational processes.&lt;/p>
&lt;p>However, prior to the design and implementation of such a system, a systematic
understanding of the preprocessing workflow is essential. Hence, throughout the
program, the students will need to:&lt;/p>
&lt;ul>
&lt;li>Understand the current system used to preprocess data for ML training, for
example, Hadoop or Spark.&lt;/li>
&lt;li>Collect the common datasets used for different types of ML models.&lt;/li>
&lt;li>Collect the typical operations used for preprocessing these datasets.&lt;/li>
&lt;li>Benchmark the performance in these operations under the existing frameworks
under various experimental settings.&lt;/li>
&lt;li>Package the benchmark such that the team can later use it for reproduction or
evaluation.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Understand the current system used to preprocess data for ML training, for
example, Hadoop or Spark.&lt;/li>
&lt;li>Collect the common datasets used for different types of ML models.&lt;/li>
&lt;li>Collect the typical operations used for preprocessing these datasets.&lt;/li>
&lt;li>Benchmark the performance in these operations under the existing frameworks
under various experimental settings.&lt;/li>
&lt;li>Package the benchmark such that the team can later use it for reproduction or evaluation.&lt;/li>
&lt;/ul></description></item></channel></rss>