<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Meng Wang | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/index.xml" rel="self" type="application/rss+xml"/><description>Meng Wang</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/avatar_hu5d444d336fed69b7fa787e98bcaaf848_8178_270x270_fill_q75_lanczos_center.jpg</url><title>Meng Wang</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/</link></image><item><title>OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/uchicago/openmlec/202406012-jiajunmao/</link><pubDate>Wed, 12 Jun 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/uchicago/openmlec/202406012-jiajunmao/</guid><description>&lt;p>Hello, I&amp;rsquo;m &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jiajun-mao/">Jiajun Mao&lt;/a>, a BS/MS student at the University of Chicago studying Computer Science. I will be spending this summer working on the project &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ornl/openmlec/">OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/">Meng Wang&lt;/a>
and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjus-george/">Anjus George&lt;/a>, my &lt;a href="https://docs.google.com/document/d/1nYgNlGdl0jUgW8avpu671oRpMoxaZHZPwlDfBNXRVro/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>.&lt;/p>
&lt;p>How to increase data’s durability and reliability while decreasing storage cost have always been interesting topics of research. Erasure coded storage systems in recent years have been seen as strong candidates to replace replications for colder storage tiers. In the paper “Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers”, the authors explored using theory and simulation on how a multiple tiered erasure coded system can out-perform systems using single level erasure codes in areas such as encoding throughput and network bandwidth consumed for repair, addressing a few pain points in adopting erasure coded storage systems. I will be implementing the theoretical and simulation result of this paper by building on top of HDFS and ZFS, and benchmarking the system performance.&lt;/p>
&lt;p>The project will aim to achieve&lt;/p>
&lt;ul>
&lt;li>HDFS understanding the underlying characteristics of ZFS as the filesystem&lt;/li>
&lt;li>HDFS understanding the failure report from ZFS, and use new and special MLEC repair logic to execute parity repair&lt;/li>
&lt;li>ZFS will be able to accept repair data from HDFS to repair a suspended pool caused by catastrophic data corruption&lt;/li>
&lt;/ul></description></item><item><title>GPEC: An Open Emulation Platform to Evaluate GPU/ML Workloads on Erasure Coding Storage</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lanl/gpec/</link><pubDate>Thu, 08 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lanl/gpec/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage Systems, Machine Learning, Erasure Coding&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Python, PyTorch, Bash scripting, Linux, Erasure Coding, Machine Learning&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/">Meng Wang&lt;/a> (primary contact), &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-bent/">John Bent&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Large-scale data centers store immense amounts of user data across a multitude of disks, necessitating redundancy strategies like erasure coding (EC) to safeguard against disk failures. Numerous research efforts have sought to assess the performance and durability of various erasure coding approaches, including single-level erasure coding, locally recoverable coding, and multi-level erasure coding.&lt;/p>
&lt;p>Despite its widespread adoption, a significant research gap exists regarding the performance of large-scale erasure-coded storage systems when exposed to machine learning (ML) workloads. While conventional practice often leans towards replication for enhanced performance, this project seeks to explore whether cost-effective erasure encoding can deliver comparable performance. In this context, several fundamental questions remain unanswered, including:
Can a typical erasure-coded storage system deliver sufficient throughput for ML training tasks?
Can an erasure-coded storage system maintain low-latency performance for ML training and inference workloads?
How does disk failure and subsequent repair impact the throughput and latency of ML workloads?
What influence do various erasure coding design choices, such as chunk placement strategies and repair methods, have on the aforementioned performance metrics?&lt;/p>
&lt;p>To address these questions, the most straightforward approach would involve running ML workloads on large-scale erasure coded storage systems within HPC data centers. However, this presents challenges for researchers and students due to limited access to expensive GPUs and distributed storage systems, especially when dealing with large-scale evaluations. Consequently, there is a need for a cost-effective evaluation platform.&lt;/p>
&lt;p>The objective of this project is to develop an open-source platform that facilitates cheap and reproducible evaluations of erasure-coded storage systems concerning ML workloads. This platform consists of two key components:
GPU Emulator: This emulator is designed to simulate GPU performance for ML workloads. Development of the GPU emulator is near completion.
EC Emulator: This emulator is designed to simulate the performance characteristics of erasure-coded storage systems. It is still in the exploratory phase and requires further development.&lt;/p>
&lt;p>The student&amp;rsquo;s responsibilities will include documenting the GPU emulator, progressing the development of the EC emulator, and packaging the experiments to ensure easy reproducibility. It is anticipated that this platform will empower researchers and students to conduct cost-effective and reproducible evaluations of large-scale erasure-coded storage systems in the context of ML workloads.&lt;/p>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Build an EC emulator to emulate the performance characteristics of large-scale erasure-coded storage systems&lt;/li>
&lt;li>Incorporate the EC emulator into ML workloads and GPU emulator&lt;/li>
&lt;li>Conduct reproducible experiments to evaluate the performance of erasure-coded storage systems in the context of ML workloads&lt;/li>
&lt;li>Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository with open-source code&lt;/li>
&lt;/ul></description></item><item><title>OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ornl/openmlec/</link><pubDate>Mon, 05 Feb 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ornl/openmlec/</guid><description>&lt;p>&lt;strong>Project Idea Description&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage Systems, Erasure Coding&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Java, Bash scripting, Linux, HDFS, ZFS, Erasure Coding&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/meng-wang/">Meng Wang&lt;/a> (&lt;a href="mailto:wangm12@uchicago.edu">Main contact person&lt;/a>) and Anjus George&lt;/li>
&lt;/ul>
&lt;p>Multi-Level Erasure Coding (MLEC), which performs erasure coding at both network and local levels, has seen large deployments in practice. Our recent research work has shown that MLEC can provide high durability with higher encoding throughput and less repair network traffic compared to other erasure coding methods. This makes MLEC particularly appealing for large-scale data centers, especially high-performance computing (HPC) systems.&lt;/p>
&lt;p>However, current MLEC systems often rely on straightforward design choices, such as Clustered/Clustered (C/C) chunk placement and the Repair-All (RALL) method for catastrophic local failures. Our recent simulations [1] have revealed the potential benefits of more complex chunk placement strategies like Clustered/Declustered (C/D), Declustered/Clustered (D/C), and Declustered/Declustered (D/D). Additionally, advanced repair methods such as Repair Failed Chunks Only (RFCO), Repair Hybrid (RHYB), and Repair Minimum (RMIN) have shown promise for improving durability and performance according to our simulations. Despite promising simulation results, these optimized design choices have not been implemented in real systems.&lt;/p>
&lt;p>In this project, we propose to develop open-source MLEC implementations in real systems, offering a range of design choices from simple to complex. Our approach leverages ZFS for local-level erasure coding and HDFS for network-level erasure coding, supporting both clustered and declustered chunk placement at each level. The student&amp;rsquo;s responsibilities include setting up HDFS on top of ZFS, configuring various MLEC chunk placements (e.g., C/D, D/C, D/D), and implementing advanced repair methods within HDFS and ZFS. The project will culminate in reproducible experiments to evaluate the performance of MLEC systems under different design choices.&lt;/p>
&lt;p>We will open-source our code and aim to provide valuable insights to the community on optimizing erasure-coded systems. Additionally, we will provide comprehensive documentation of our work and share Trovi artifacts on Chameleon Cloud to facilitate easy reproducibility of our experiments.&lt;/p>
&lt;p>[1] Meng Wang, Jiajun Mao, Rajdeep Rana, John Bent, Serkay Olmez, Anjus George, Garrett Wilson Ransom, Jun Li, and Haryadi S. Gunawi. Design Considerations and Analysis of Multi-Level Erasure Coding in Large- Scale Data Centers. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023.&lt;/p>
&lt;p>&lt;strong>Project Deliverable&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Open-source MLEC implementations with a diverse range of design choices.&lt;/li>
&lt;li>Configuration setup for HDFS on top of ZFS, supporting various MLEC chunk placements.&lt;/li>
&lt;li>Implementation of advanced repair methods within HDFS and ZFS.&lt;/li>
&lt;li>Reproducible experiments to assess the performance of MLEC systems across distinct design choices.&lt;/li>
&lt;li>Comprehensive documentation of the project and the provision of shared Trovi artifacts on Chameleon Cloud for ease of reproducibility.&lt;/li>
&lt;/ul></description></item></channel></rss>