<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>storage systems | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/storage-systems/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/storage-systems/index.xml" rel="self" type="application/rss+xml"/><description>storage systems</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Wed, 06 Aug 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>storage systems</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/storage-systems/</link></image><item><title>Midterm Report: Simulation, Comparison, and Conclusion of Cache Eviction</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/harvard/cachebench/2025-08-06-haochengxia/</link><pubDate>Wed, 06 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/harvard/cachebench/2025-08-06-haochengxia/</guid><description>&lt;h2 id="project-overview">Project Overview&lt;/h2>
&lt;p>&lt;strong>CacheBench&lt;/strong> is a benchmarking suite designed for comprehensive cache performance evaluation, with a particular focus on analyzing the miss ratios of various cache eviction algorithms.&lt;/p>
&lt;p>At the core of CacheBench lie two key components: the high-performance cache simulator, &lt;a href="https://github.com/1a1a11a/libCacheSim" target="_blank" rel="noopener">libCacheSim&lt;/a>, and the extensive &lt;a href="https://github.com/cacheMon/cache_dataset" target="_blank" rel="noopener">open-source cache datasets&lt;/a>, which collectively contain over 8,000 traces from diverse applications. This ensures broad coverage across a range of realistic workloads.&lt;/p>
&lt;p>Our primary goal is to evaluate all major and widely-used cache eviction algorithms on thousands of traces, in order to gain insights into their behaviors and design trade-offs. Additionally, we aim to identify and distill representative workloads, making benchmarking more efficient and comprehensive for future cache research.&lt;/p>
&lt;h2 id="progress-and-pain-points">Progress and Pain Points&lt;/h2>
&lt;p>We began by benchmarking prevalent eviction algorithms, including FIFO, LRU, CLOCK, LFU, Random, Belady (BeladySize), CAR, ARC, LIRS, LHD, Hyperbolic, GDSF, W-TinyLFU, 2Q, SLRU, S3-FIFO, SIEVE, and LeCaR. As we developed the suite, we made progressive improvements to both the simulator and dataset infrastructure. Our progress can be summarized as follows:&lt;/p>
&lt;ul>
&lt;li>Collected miss ratio results for all listed algorithms across 8,000+ traces.&lt;/li>
&lt;li>Identified best- and worst-performing traces for each algorithm, and conducted feature analysis of these traces.&lt;/li>
&lt;li>Developed Python bindings: To increase accessibility, we provided a Python package that allows users to easily download traces and run simulation analyses using &lt;a href="https://github.com/1a1a11a/libCacheSim" target="_blank" rel="noopener">libCacheSim&lt;/a> and the &lt;a href="https://github.com/cacheMon/cache_dataset" target="_blank" rel="noopener">cache datasets&lt;/a>.&lt;/li>
&lt;/ul>
&lt;p>However, analysis remains challenging because there is no universally accepted metric or baseline for objectively comparing cache eviction algorithms&amp;rsquo; performance across all workloads.&lt;/p>
&lt;h2 id="next-steps">Next Steps&lt;/h2>
&lt;p>For the second half of the project, my focus will shift to:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Evaluating More Complex Eviction Algorithms&lt;/strong>: Having concentrated mainly on static eviction policies so far (which are generally more deterministic and understandable), I will now investigate learning-based eviction algorithms such as LRB and 3L-Cache. These models incorporate learning components and incur additional computational overhead, making simulations slower and more complex.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Detailed Trace Analysis&lt;/strong>: Since eviction algorithms can have highly variable performance on the same trace, I plan to analyze why certain algorithms excel on specific traces while others do not. Understanding these factors is crucial to characterizing both the algorithms and the workload traces.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Constructing Representative Workload Sets&lt;/strong>: Based on ongoing simulations and trace analyses, I aim to identify a minimal but representative subset of traces that can serve as a basic evaluation suite, simplifying testing and improving accessibility.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="reflection">Reflection&lt;/h2>
&lt;p>This project has truly been the highlight of my summer. By evaluating a wide range of cache eviction algorithms, I&amp;rsquo;ve significantly deepened my understanding of cache design and its underlying principles.&lt;/p>
&lt;p>I&amp;rsquo;m especially grateful to my mentors for their constant support, patience, and guidance throughout this journey. It’s been a privilege to learn from you!&lt;/p>
&lt;p>I&amp;rsquo;m excited to see the final results of CacheBench!&lt;/p></description></item><item><title>Building a Benchmarking Suite for Cache Performance Evaluation</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/harvard/cachebench/2025-06-21-haochengxia/</link><pubDate>Sat, 21 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/harvard/cachebench/2025-06-21-haochengxia/</guid><description>&lt;p>Hi! I&amp;rsquo;m Haocheng Xia, a Computer Science student at the &lt;strong>University of Illinois Urbana-Champaign&lt;/strong>, passionate about the intersection of &lt;strong>machine learning and storage systems&lt;/strong>. Specifically, I&amp;rsquo;m keen on &lt;strong>workload analysis&lt;/strong> and &lt;strong>KV cache management for large language models&lt;/strong>.&lt;/p>
&lt;p>This summer, I&amp;rsquo;m happy to be a part of &lt;strong>SoR 2025&lt;/strong> and &lt;strong>OSRE 2025&lt;/strong>. I&amp;rsquo;m contributing to the &lt;strong>CacheBench&lt;/strong> project. My initiative, &lt;strong>&amp;lsquo;Building a Benchmarking Suite for Cache Performance Evaluation,&amp;rsquo;&lt;/strong> will create a robust platform. This involves extensive simulation of existing eviction algorithms using &lt;a href="https://github.com/cacheMon/libCacheSim" target="_blank" rel="noopener">libCacheSim&lt;/a>, developing microbenchmarks, and building a user-friendly platform for researchers to effortlessly evaluate novel cache designs. The ultimate goal is to establish a competitive leaderboard.&lt;/p>
&lt;p>My contributions will include a comprehensive dataset detailing simulated &lt;strong>miss ratios&lt;/strong> and &lt;strong>throughput&lt;/strong> of current cache eviction algorithms, an extension to &lt;a href="https://github.com/cacheMon/libCacheSim" target="_blank" rel="noopener">libCacheSim&lt;/a> for executing microbenchmarks both locally and on our online platform, and the creation and ongoing maintenance of a public web leaderboard. I&amp;rsquo;m grateful to be mentored by &lt;strong>Juncheng Yang&lt;/strong> and &lt;strong>Yazhuo Zhang&lt;/strong>.&lt;/p>
&lt;p>I&amp;rsquo;m thrilled to be part of building tools that empower users and advance the vision of a more decentralized web. Looking forward to a productive summer!&lt;/p></description></item><item><title>CacheBench: Building a Benchmarking Suite for Cache Performance Evaluation</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/harvard/cachebench/</link><pubDate>Fri, 28 Feb 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/harvard/cachebench/</guid><description>&lt;h3 id="overview">Overview&lt;/h3>
&lt;p>In this project, we aim to develop a comprehensive benchmarking suite, CacheBench, for evaluating the performance of cache systems in modern computing environments. Caches play a crucial role in enhancing system performance by reducing latency and improving data access speeds. However, evaluating cache performance is a complex task that requires a diverse set of workloads and metrics to capture the cache&amp;rsquo;s behavior accurately. The current focus is on the eviction algorithms and if time permits, we will extend to other components of cache design.&lt;/p>
&lt;p>This project will have three main components:&lt;/p>
&lt;ol>
&lt;li>Implementing and benchmarking existing cache eviction algorithms in &lt;a href="https://libcachesim.com/" target="_blank" rel="noopener">libCacheSim&lt;/a> using large-scale simulation. This part will mainly focus on reproducing existing works.&lt;/li>
&lt;li>Developing a set of microbenchmarks and a platform for researchers to evaluate new designs with little effort in the future. This part will focus on building the open-source infrastructure for future research.&lt;/li>
&lt;li>Developing a leaderboard for the community to submit new algorithms and workloads. This part will focus on building the community and fostering adoption and collaboration.&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> storage systems, benchmarking, performance evaluation&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C programming, web programming (e.g., node.js, React), database management&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours).&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juncheng-yang/">Juncheng Yang&lt;/a>, Yazhuo Zhang (&lt;a href="mailto:yazhuo@inf.ethz.ch">yazhuo@inf.ethz.ch&lt;/a>)&lt;/li>
&lt;/ul></description></item><item><title>FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/fep_bench/</link><pubDate>Mon, 03 Jun 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/fep_bench/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, Bash scripting, Linux, Machine Learning modeling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yuyang-roy-huang/">Yuyang (Roy) Huang&lt;/a> (primary contact), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/swami-sundararaman/">Swami Sundararaman&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/lihaowen-jayce-zhu/">Lihaowen (Jayce) Zhu&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>In the realm of machine learning (ML), preprocessing of data is a critical yet often underappreciated phase, consuming approximately 80% of the time in common ML tasks. This extensive time consumption can be attributed to various challenges encountered from both data and computation perspectives.&lt;/p>
&lt;p>From the data side, one significant challenge is the slow retrieval of data from data lakes, which are storage repositories that hold a vast amount of raw data in its native format. However, the process of extracting this data can be slow, causing computation cycles to wait for data arrival and leading to delays in the entire preprocessing phase. Furthermore, the size of the data often exceeds the memory capacity of standard computing systems. This is a frequent occurrence in ML, as datasets are typically large and complex. Handling such large datasets requires sophisticated memory management techniques to ensure efficient preprocessing without overwhelming the system&amp;rsquo;s memory.&lt;/p>
&lt;p>On the computation side, a naive solution to data operations, especially aggregation, often leads to inefficiencies. These operations may require grouping a large chunk of data as a prerequisite before performing any actual computation. This grouping, without careful configuration and management, can trigger serious data shuffling, leading to extensive remote data movement when the data is distributed across various storage systems. Such data movement is not only time-consuming but also resource-intensive.&lt;/p>
&lt;p>To mitigate these challenges, there is a pressing need to design better caching, prefetching, and heuristic strategies for data preprocessing. The team aims to significantly reduce the time and resources required for preprocessing by optimizing data retrieval and computational processes.&lt;/p>
&lt;p>However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:&lt;/p>
&lt;ul>
&lt;li>Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.&lt;/li>
&lt;li>Collect the common datasets used for different types of ML models.&lt;/li>
&lt;li>Collect the typical operations used for preprocessing these datasets.&lt;/li>
&lt;li>Benchmark the performance in these operations under the existing frameworks under various experimental settings.&lt;/li>
&lt;li>Package the benchmark such that the team can later use it for reproduction or evaluation.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>A rolodex for the commonly used dataset and corresponding preprocess operations and expected output formats/types&lt;/li>
&lt;li>A Chameleon Trovi package that preprocess the dataset with single-machine preprocessing framework like pandas&lt;/li>
&lt;li>A Chameleon Trovi package that preprocess the dataset in an existing distributed computation framework like Hadoop or Spark&lt;/li>
&lt;/ul></description></item><item><title>GPEC: An Open Emulation Platform to Evaluate GPU/ML Workloads on Erasure Coding Storage</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lanl/gpec/</link><pubDate>Thu, 08 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lanl/gpec/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage Systems, Machine Learning, Erasure Coding&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Python, PyTorch, Bash scripting, Linux, Erasure Coding, Machine Learning&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/">Meng Wang&lt;/a> (primary contact), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-bent/">John Bent&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Large-scale data centers store immense amounts of user data across a multitude of disks, necessitating redundancy strategies like erasure coding (EC) to safeguard against disk failures. Numerous research efforts have sought to assess the performance and durability of various erasure coding approaches, including single-level erasure coding, locally recoverable coding, and multi-level erasure coding.&lt;/p>
&lt;p>Despite its widespread adoption, a significant research gap exists regarding the performance of large-scale erasure-coded storage systems when exposed to machine learning (ML) workloads. While conventional practice often leans towards replication for enhanced performance, this project seeks to explore whether cost-effective erasure encoding can deliver comparable performance. In this context, several fundamental questions remain unanswered, including:
Can a typical erasure-coded storage system deliver sufficient throughput for ML training tasks?
Can an erasure-coded storage system maintain low-latency performance for ML training and inference workloads?
How does disk failure and subsequent repair impact the throughput and latency of ML workloads?
What influence do various erasure coding design choices, such as chunk placement strategies and repair methods, have on the aforementioned performance metrics?&lt;/p>
&lt;p>To address these questions, the most straightforward approach would involve running ML workloads on large-scale erasure coded storage systems within HPC data centers. However, this presents challenges for researchers and students due to limited access to expensive GPUs and distributed storage systems, especially when dealing with large-scale evaluations. Consequently, there is a need for a cost-effective evaluation platform.&lt;/p>
&lt;p>The objective of this project is to develop an open-source platform that facilitates cheap and reproducible evaluations of erasure-coded storage systems concerning ML workloads. This platform consists of two key components:
GPU Emulator: This emulator is designed to simulate GPU performance for ML workloads. Development of the GPU emulator is near completion.
EC Emulator: This emulator is designed to simulate the performance characteristics of erasure-coded storage systems. It is still in the exploratory phase and requires further development.&lt;/p>
&lt;p>The student&amp;rsquo;s responsibilities will include documenting the GPU emulator, progressing the development of the EC emulator, and packaging the experiments to ensure easy reproducibility. It is anticipated that this platform will empower researchers and students to conduct cost-effective and reproducible evaluations of large-scale erasure-coded storage systems in the context of ML workloads.&lt;/p>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Build an EC emulator to emulate the performance characteristics of large-scale erasure-coded storage systems&lt;/li>
&lt;li>Incorporate the EC emulator into ML workloads and GPU emulator&lt;/li>
&lt;li>Conduct reproducible experiments to evaluate the performance of erasure-coded storage systems in the context of ML workloads&lt;/li>
&lt;li>Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository with open-source code&lt;/li>
&lt;/ul></description></item><item><title>FSA: Benchmarking Fail-Slow Algorithms</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/failslowalgorithms/</link><pubDate>Tue, 06 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/failslowalgorithms/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, Bash scripting, Linux, Machine Learning modeling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/ruidan-li/">Ruidan Li&lt;/a> (primary contact), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/kexin-pei/">Kexin Pei&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>In the realm of modern applications, achieving not only low but also predictable response times is a critical requirement. Performance instability, even when it amounts to just a few milliseconds of delay, can result in violations of Service Level Objectives (SLOs). Redundancy at the RAID group level provides a layer of protection; however, the early identification of potential slowdowns or failures is paramount in minimizing their impact on overall system latency.&lt;/p>
&lt;p>Fail-Slow represents a unique type of fault within storage systems, characterized by the system&amp;rsquo;s ability to continue functioning while progressively deteriorating – its performance significantly drops below expected levels. Notably, fail-slow conditions are responsible for a considerable share of latency tails. Detecting fail-slow faults is particularly challenging, as they can be easily masked by the normal fluctuations in performance. Consequently, the identification of fail-slow faults is a critical area of research, demanding meticulous attention.&lt;/p>
&lt;p>Several strategies have been developed to address the fail-slow issue, yet the question of their broad applicability remains. We plan to implement and assess various existing fail-slow detection algorithms, examining their strengths and weaknesses. Our analysis will concentrate on key questions:&lt;/p>
&lt;p>How promptly can the algorithm identify a fail-slow symptom?
What methods does the algorithm employ to accurately distinguish fail-slow incidents, thereby minimizing false negatives?
Through what approach does the algorithm achieve the right sensitivity level to keep false positives in check?&lt;/p>
&lt;p>This evaluation aims to shed light on the effectiveness of current methodologies in detecting fail-slow faults, crucial for enhancing system reliability and performance.&lt;/p>
&lt;p>Building upon our evaluation of several fail-slow detection algorithms, our objective is to harness advanced machine learning (ML) models to develop a novel algorithm. This initiative seeks to address and potentially compensate for the identified weaknesses in existing methodologies. By focusing on the critical aspects of early detection, accurate differentiation, and optimal sensitivity, we aim to create a solution that reduces both false negatives and false positives, thereby enhancing overall system reliability. This approach represents a strategic effort to not only advance the current state of fail-slow detection but also to contribute significantly to the resilience and performance of storage systems.&lt;/p>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>A Trovi artifact for the existing Fail-Slow detection algorithms on Chameleon Cloud&lt;/li>
&lt;li>A GitHub repository containing the full evaluation result&lt;/li>
&lt;li>A Google Colab notebook for quick replay&lt;/li>
&lt;/ul></description></item><item><title>OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ornl/openmlec/</link><pubDate>Mon, 05 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ornl/openmlec/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage Systems, Erasure Coding&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Java, Bash scripting, Linux, HDFS, ZFS, Erasure Coding&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/">Meng Wang&lt;/a> (&lt;a href="mailto:wangm12@uchicago.edu">Main contact person&lt;/a>) and Anjus George&lt;/li>
&lt;/ul>
&lt;p>Multi-Level Erasure Coding (MLEC), which performs erasure coding at both network and local levels, has seen large deployments in practice. Our recent research work has shown that MLEC can provide high durability with higher encoding throughput and less repair network traffic compared to other erasure coding methods. This makes MLEC particularly appealing for large-scale data centers, especially high-performance computing (HPC) systems.&lt;/p>
&lt;p>However, current MLEC systems often rely on straightforward design choices, such as Clustered/Clustered (C/C) chunk placement and the Repair-All (RALL) method for catastrophic local failures. Our recent simulations [1] have revealed the potential benefits of more complex chunk placement strategies like Clustered/Declustered (C/D), Declustered/Clustered (D/C), and Declustered/Declustered (D/D). Additionally, advanced repair methods such as Repair Failed Chunks Only (RFCO), Repair Hybrid (RHYB), and Repair Minimum (RMIN) have shown promise for improving durability and performance according to our simulations. Despite promising simulation results, these optimized design choices have not been implemented in real systems.&lt;/p>
&lt;p>In this project, we propose to develop open-source MLEC implementations in real systems, offering a range of design choices from simple to complex. Our approach leverages ZFS for local-level erasure coding and HDFS for network-level erasure coding, supporting both clustered and declustered chunk placement at each level. The student&amp;rsquo;s responsibilities include setting up HDFS on top of ZFS, configuring various MLEC chunk placements (e.g., C/D, D/C, D/D), and implementing advanced repair methods within HDFS and ZFS. The project will culminate in reproducible experiments to evaluate the performance of MLEC systems under different design choices.&lt;/p>
&lt;p>We will open-source our code and aim to provide valuable insights to the community on optimizing erasure-coded systems. Additionally, we will provide comprehensive documentation of our work and share Trovi artifacts on Chameleon Cloud to facilitate easy reproducibility of our experiments.&lt;/p>
&lt;p>[1] Meng Wang, Jiajun Mao, Rajdeep Rana, John Bent, Serkay Olmez, Anjus George, Garrett Wilson Ransom, Jun Li, and Haryadi S. Gunawi. Design Considerations and Analysis of Multi-Level Erasure Coding in Large- Scale Data Centers. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023.&lt;/p>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Open-source MLEC implementations with a diverse range of design choices.&lt;/li>
&lt;li>Configuration setup for HDFS on top of ZFS, supporting various MLEC chunk placements.&lt;/li>
&lt;li>Implementation of advanced repair methods within HDFS and ZFS.&lt;/li>
&lt;li>Reproducible experiments to assess the performance of MLEC systems across distinct design choices.&lt;/li>
&lt;li>Comprehensive documentation of the project and the provision of shared Trovi artifacts on Chameleon Cloud for ease of reproducibility.&lt;/li>
&lt;/ul></description></item><item><title>FetchPipe: Data Science Pipeline for ML-based Prefetching</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/fetchpipe/</link><pubDate>Fri, 02 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/fetchpipe/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Python, PyTorch, Bash scripting, Linux, Machine Learning modeling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/daniar-h.-kurniawan/">Daniar H. Kurniawan&lt;/a> (primary contact), Haryadi Gunawi&lt;/li>
&lt;/ul>
&lt;p>The contemporary landscape of high-performance servers, particularly those designed for data centers and AI/ML training, prominently features solid-state drives (SSDs) and spinning disks (HDDs) as primary storage devices. These components play a crucial role in shaping overall system performance, underscoring the importance of addressing and minimizing Input/Output (I/O) latency. This is particularly crucial given the widespread adoption of hybrid storage systems, where caching and prefetching strategies are instrumental in optimizing storage performance. Caching involves using faster but less dense memory to store frequently accessed data, while prefetching aims to reduce latency by fetching data from slower memory to cache before it is needed. Although both caching and prefetching present valid challenges, our primary emphasis is on the prefetching problem due to the inherent difficulty in predicting future access.&lt;/p>
&lt;p>Traditional prefetchers, dating back 1-2 decades, heavily rely on predefined rules for prefetching based on LBA access sequences, limiting their adaptability to complex scenarios. For instance, the read-ahead prefetcher is confined to prefetching the next data item within a file for faster sequential access. Addressing this limitation, recent advancements include learning-based methods, such as Long Short-Term Memory (LSTM) techniques like DeepPrefetcher and Delta LSTM, which model the LBA delta to cover a broader range of LBAs. However, they are still struggling to achieve high accuracy when the workload pattern changes drastically. Although there are some sophisticated prefetchers capable of learning complex I/O access patterns using Graph structure, they face challenges in their deployment due to the computational cost.&lt;/p>
&lt;p>In this project, our goal is to provide an end-to-end data science pipeline to empower the research on ML-based prefetchers. We believe that this pipeline is crucial for fostering active collaboration between the ML community and storage systems researchers. This collaboration aims to optimize existing ML-based prefetching solutions. Specifically, we will provide the dataset for training/testing and some samples of ML-based models that can further be developed by the community. Furthermore, we will also provide a setup for evaluating the ML model when deployed in storage systems.&lt;/p>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Compile I/O traces from various open traces and open systems.&lt;/li>
&lt;li>Develop a pipeline for building ML-based prefetching solutions.&lt;/li>
&lt;li>Build a setup to evaluate the model in a real hybrid storage system.&lt;/li>
&lt;li>Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository&lt;/li>
&lt;/ul></description></item><item><title>FlashNet: Towards Reproducible Data Science for Storage System</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet/</link><pubDate>Thu, 02 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet/</guid><description>&lt;p>The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:&lt;/p>
&lt;ol>
&lt;li>How can we encourage data scientists to look into storage problems?&lt;/li>
&lt;li>How can we create a transparent platform that allows such decoupling?&lt;/li>
&lt;li>Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?&lt;/li>
&lt;/ol>
&lt;p>In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.&lt;/p>
&lt;p>In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.&lt;/p>
&lt;h3 id="building-flashnet-platform">Building FlashNet Platform&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, reproducibility, machine learning, continual learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C++, Python, PyTorch, Experienced with Machine Learning pipeline&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/justin-shin/">Justin Shin&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/maharani-ayu-putri-irawan/">Maharani Ayu Putri Irawan&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Reproduce the FlashNet evaluation results from prior works.&lt;/li>
&lt;li>Build and improve FlashNet components based on the existing blueprint.&lt;/li>
&lt;li>Collect and analyze the FlashNet evaluation results.&lt;/li>
&lt;/ul></description></item><item><title>Reproducible Evaluation of Multi-level Erasure Coding</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ornl/multilevelerasure/</link><pubDate>Thu, 02 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ornl/multilevelerasure/</guid><description>&lt;p>Massive storage systems rely heavily on erasure coding (EC) to protect data from drive failures and provide data durability. Existing storage systems mostly adopt single-level erasure coding (SLEC) to protect data, either performing EC at the network level or performing EC at the local level. However, both SLEC approaches have limitations, as network-only SLEC introduces heavy network traffic overhead, and local-only SLEC cannot tolerate rack failures.&lt;/p>
&lt;p>Accordingly, some data centers are starting to use multi-level erasure coding (MLEC), which is a hybrid approach performing EC at both the network level and the local level. However, prior EC research and evaluations mostly focused on SLEC, and it remains to be answered how MLEC is compared to SLEC in terms of durability, capacity overhead, encoding throughput, network traffic, and other overheads.&lt;/p>
&lt;p>Therefore, in this project we seek to build a platform to evaluate the durability and overheads of MLEC. The platform will allow us to evaluate dozens of EC strategies in many dimensions including recovery strategies, chunk placement choices, various parity schemes, etc. To the best of our knowledge, there is no other evaluation platform like what we propose here. We seek to make the platform open-source and the evaluation reproducible, allowing future researchers to benefit from it and conduct more research on MLEC.&lt;/p>
&lt;h3 id="building-a-platform-to-evaluate-mlec">Building a platform to evaluate MLEC&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, reproducibility, erasure coding, evaluation&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Linux, C, Python&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-bent/">John Bent&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjus-george/">Anjus George&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/zhiyan-alex-wang/">Zhiyan &amp;quot;Alex&amp;quot; Wang&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Build a platform to evaluate the durability and overheads of MLEC. The platform will be able to evaluate different EC strategies in various dimensions including repair strategies, chunk placement choices, parity schemes, etc. Analyze the evaluation results.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Reproduce the SLEC evaluation results from prior SLEC evaluation tools&lt;/li>
&lt;li>Based on prior SLEC evaluation tools, build a platform to evaluate the durability and overheads of MLEC under various EC strategies&lt;/li>
&lt;li>Collect and analyze the MLEC evaluation results&lt;/li>
&lt;/ul></description></item><item><title>CephFS</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/ucsc/cephfs/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/ucsc/cephfs/</guid><description>&lt;p>&lt;a href="https://docs.ceph.com/en/latest/cephfs/" target="_blank" rel="noopener">CephFS&lt;/a> is a distributed file system on top of &lt;a href="https://ceph.io" target="_blank" rel="noopener">Ceph&lt;/a>. It is implemented as a distributed metadata service (MDS) that uses dynamic subtree balancing to trade parallelism for locality during a continually changing workloads. Clients that mount a CephFS file system connect to the MDS and acquire capabilities as they traverse the file namespace. Capabilities not only convey metadata but can also implement strong consistency semantics by granting and revoking the ability of clients to cache data locally.&lt;/p>
&lt;h3 id="cephfs-namespace-traversal-offloading">CephFS namespace traversal offloading&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Ceph&lt;/code>, &lt;code>filesystems&lt;/code>, &lt;code>metadata&lt;/code>, &lt;code>programmable storage&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C++, Ceph / MDS&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="mailto:carlosm@ucsc.edu">Carlos Maltzahn&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The frequency of metadata service (MDS) requests relative to the amount of data accessed can severely affect the performance of distributed file systems like CephFS, especially for workloads that randomly access a large number of small files as is commonly the case for machine learning workloads: they purposefully randomize access for training and evaluation to prevent overfitting. The datasets of these workloads are read-only and therefore do not require strong coherence mechanisms that metadata services provide by default.&lt;/p>
&lt;p>The key idea of this project is to reduce the frequency of MDS requests by offloading namespace traversal, i.e. the need to open a directory, list its entries, open each subdirectory, etc. Each of these operations usually require a separate MDS request. Offloading namespace traversal refers to a client’s ability to request the metadata (and associated read-only capabilities) of an entire subtree with one request, thereby offloading the traversal work for tree discovery to the MDS.&lt;/p>
&lt;p>Once the basic functionality is implemented, this project can be expanded to address optimization opportunities, e.g. describing regular tree structures as a closed form expression in the tree’s root, shortcutting tree discovery.&lt;/p></description></item><item><title>HDF5</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/lbl/hdf5/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/lbl/hdf5/</guid><description>&lt;p>&lt;a href="https://portal.hdfgroup.org/display/knowledge/What&amp;#43;is&amp;#43;HDF5" target="_blank" rel="noopener">HDF5&lt;/a> is a unique technology suite that makes possible the management of extremely large and complex data collections.&lt;/p>
&lt;p>The HDF5 technology suite includes:&lt;/p>
&lt;ul>
&lt;li>A versatile data model that can represent very complex data objects and a wide variety of metadata.&lt;/li>
&lt;li>A completely portable file format with no limit on the number or size of data objects in the collection.&lt;/li>
&lt;li>A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.&lt;/li>
&lt;li>A rich set of integrated performance features that allow for access time and storage space optimizations.&lt;/li>
&lt;li>Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.&lt;/li>
&lt;/ul>
&lt;h3 id="python-interface-to-hdf5-asynchronous-io">Python Interface to HDF5 Asynchronous I/O&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Python&lt;/code>, &lt;code>Async I/O&lt;/code>, &lt;code>HDF5&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, C, HDF5&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>, &lt;a href="mailto:htang4@lbl.gov">Houjun Tang&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>HDF5 is a well-known library for storing and accessing (known as &amp;ldquo;Input and Output&amp;rdquo; or I/O) data on high-performance computing systems. Recently, new technologies, such as asynchronous I/O and caching, have been developed to utilize fast memory and storage devices and to hide the I/O latency. Applications can take advantage of an asynchronous interface by scheduling I/O as early as possible and overlapping computation with I/O operations to improve overall performance. The existing HDF5 asynchronous I/O feature supports the C/C++ interface. This project involves the development and performance evaluation of a Python interface that would allow more Python-based scientific codes to use and benefit from the asynchronous I/O.&lt;/p></description></item></channel></rss>