Haryadi S. Gunawi | UCSC OSPO

FetchPipe: Data Science Pipeline for ML-based Prefetching

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Daniar H. Kurniawan (primary contact), Haryadi Gunawi

The contemporary landscape of high-performance servers, particularly those designed for data centers and AI/ML training, prominently features solid-state drives (SSDs) and spinning disks (HDDs) as primary storage devices. These components play a crucial role in shaping overall system performance, underscoring the importance of addressing and minimizing Input/Output (I/O) latency. This is particularly crucial given the widespread adoption of hybrid storage systems, where caching and prefetching strategies are instrumental in optimizing storage performance. Caching involves using faster but less dense memory to store frequently accessed data, while prefetching aims to reduce latency by fetching data from slower memory to cache before it is needed. Although both caching and prefetching present valid challenges, our primary emphasis is on the prefetching problem due to the inherent difficulty in predicting future access.

Traditional prefetchers, dating back 1-2 decades, heavily rely on predefined rules for prefetching based on LBA access sequences, limiting their adaptability to complex scenarios. For instance, the read-ahead prefetcher is confined to prefetching the next data item within a file for faster sequential access. Addressing this limitation, recent advancements include learning-based methods, such as Long Short-Term Memory (LSTM) techniques like DeepPrefetcher and Delta LSTM, which model the LBA delta to cover a broader range of LBAs. However, they are still struggling to achieve high accuracy when the workload pattern changes drastically. Although there are some sophisticated prefetchers capable of learning complex I/O access patterns using Graph structure, they face challenges in their deployment due to the computational cost.

In this project, our goal is to provide an end-to-end data science pipeline to empower the research on ML-based prefetchers. We believe that this pipeline is crucial for fostering active collaboration between the ML community and storage systems researchers. This collaboration aims to optimize existing ML-based prefetching solutions. Specifically, we will provide the dataset for training/testing and some samples of ML-based models that can further be developed by the community. Furthermore, we will also provide a setup for evaluating the ML model when deployed in storage systems.

Project Deliverable

Compile I/O traces from various open traces and open systems.
Develop a pipeline for building ML-based prefetching solutions.
Build a setup to evaluate the model in a real hybrid storage system.
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository

ScaleBugs: Reproducible Scalability Bugs

Tue, 07 Feb 2023 00:00:00 +0000

Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.

In this project, our goal is to build a dataset of reproducible scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.

Building a Dataset of Reproducible Scalability Bugs

Topics: Scalability systems, bug patterns, reproducibility, bug dataset
Skills: Linux Shell, Docker, Java, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Cindy Rubio González, Haryadi S. Gunawi, Hao-Nan Zhu
Contributor(s): Goodness Ayinmode, Zahra Nabila Maharani

The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.

Specific Tasks

Work with the mentors to understand the context of the project.
Learn the background of scalability systems.
Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.
Craft shell scripts to trigger the exact scalability bug described by the bug report.
Organize the reproducible scalability bugs and write documentation to build the code and trigger the bug.

FlashNet: Towards Reproducible Data Science for Storage System

Thu, 02 Feb 2023 00:00:00 +0000

The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:

How can we encourage data scientists to look into storage problems?
How can we create a transparent platform that allows such decoupling?
Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?

In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.

In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.

Building FlashNet Platform

Topics: Storage systems, reproducibility, machine learning, continual learning
Skills: C++, Python, PyTorch, Experienced with Machine Learning pipeline
Difficulty: Medium
Size: Large (350 hours)
Mentors: Haryadi S. Gunawi
Contributor(s): Justin Shin, Maharani Ayu Putri Irawan

Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the FlashNet evaluation results from prior works.
Build and improve FlashNet components based on the existing blueprint.
Collect and analyze the FlashNet evaluation results.