Ruidan Li | UCSC OSPO

Final Blog: FSA - Benchmarking Fail-Slow Algorithms

Wed, 14 Aug 2024 00:00:00 +0000

Introduction

Hello! I hope you’re enjoying the summer as much as I am. I’m excited to join the SOR community as a 2024 contributor. My name is Xikang Song, and I’m thrilled to collaborate with mentors Ruidan Li and Kexin Pei on the FSA-Benchmark project. This project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. Throughout this journey, we tested a broad range of algorithms, from traditional approaches to state-of-the-art techniques, using a robust evaluation system to compare their effectiveness.

In the first half of the project, I focused on implementing and testing different machine learning models for detecting disks at high risk of fail-slow anomalies. This involved setting up initial models such as the Cost-Sensitive Ranking Model and Multi-Prediction Models, and beginning to explore LSTM networks for analyzing input disk data.

In the second half, I built upon this foundation by refining the evaluation processes, exploring advanced models like PatchTST, and investigating the potential of large language models (LLMs) for detecting subtle fail-slow conditions in storage systems. This blog post will summarize the key achievements, findings, and comparisons with baseline models from this phase.

Key Achievements

Comprehensive Benchmarking and Evaluation:
- I extended the benchmarking framework to evaluate multiple algorithms across 25 different data clusters on PERSEUS. This process involved generating and analyzing heatmaps that visualized the precision and recall of each model under various settings, providing a clear understanding of each approach’s strengths and limitations.
Exploration of Advanced Machine Learning Models:
- LSTM Model: I implemented the Long Short-Term Memory (LSTM) model, specifically designed for sequential data, to capture temporal dependencies in disk performance metrics. This model was used to predict potential fail-slow anomalies by analyzing historical data. Using Mean Squared Error (MSE) as a risk indicator, the LSTM model outperformed baseline approaches like the Cost-Sensitive Ranking Model and Multi-Prediction Models, especially in clusters where latency patterns between faulty and normal disks were distinct, such as in Cluster_P. This resulted in a higher precision and fewer false positives. However, in clusters with more complex and overlapping data distributions, like Cluster_L, the LSTM model’s performance diminished, similar to that of the baseline models
- PatchTST Model: I also introduced and evaluated the PatchTST model, which is built on a transformer-based architecture known for its ability to handle sequential data by capturing long-range dependencies and intricate temporal patterns. Unlike traditional models, PatchTST processes time series data in segments or “patches,” enhancing its ability to predict disk behavior over extended periods. Like the LSTM model, PatchTST uses outlier MSE values to assess disk risk. In clusters with a clear separation between faulty and normal disks, PatchTST outperformed baseline models by effectively identifying faulty patterns. However, similar to the LSTM model, PatchTST encountered difficulties in clusters with significant data overlap, such as Cluster_L.
Investigation into Large Language Models (LLMs):
- I explored the use of GPT-4-o-mini for fail-slow detection. While large language models (LLMs) showed potential, particularly in reducing false positives and improving precision over baseline models, they did not consistently outperform specialized models like LSTM and PatchTST in this context. LLMs struggled with recall, especially as thresholds increased, revealing the challenges of adapting LLMs to time series data. This limitation arises because LLMs are primarily trained for natural language generation tasks, not for analyzing time series data. As a result, their ability to fully capture anomalies is limited. To improve their effectiveness, we need to develop methods that help LLMs better understand time series data. For example, incorporating statistical information about each disk’s performance could enhance LLMs’ understanding, leading to better precision in fail-slow detection.

Conclusion and Future Work

The work in this project demonstrated that while advanced machine learning models like LSTM and PatchTST offer significant potential for detecting fail-slow conditions, challenges remain in ensuring consistent performance across diverse clusters. Compared to baseline models, these advanced approaches generally provided better precision and recall, especially in clusters with distinct data patterns between faulty and normal disk performance time series. However, the persistent difficulties in more complex clusters indicate the need for further refinement.

Moving forward, future work will focus on refining these models, particularly in improving their performance in challenging clusters like Cluster_L. Additionally, I plan to further explore techniques such as prompt engineering for LLMs to better tailor them for time series analysis and fail-slow detection tasks.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the FSA_BENCHMARK GitHub Repository.
Jupyter Notebook: A notebook to reproduce the experiments and benchmarks on Chameleon: Chameleon Experiment Notebook.
Final Report: Comprehensive algorithm performance evaluation for all methods in FSA-Benchmarking Final Report.

FSA: Benchmarking Fail-Slow Algorithms

Wed, 12 Jun 2024 00:00:00 +0000

Hi everyone! I’m Xikang, a master’s CS student at UChicago. As a part of FSA benchmarking Project, I’m thrilled to be a contributor to OSRE 2024, collaborating with Kexin Pei, the assistant Professor of Computer Science at Uchicago and Ruidan, a talented PhD student at UChicago.

This summer, I will focus on integrating some advanced ML into our RAID slowdown analysis. Our aim is to assess whether LLMs can effectively identify RAID slowdown issues and to benchmark their performance against our current machine learning algorithms. We will test the algorithms on Chameleon Cloud and benchmark them.

Additionally, we will explore optimization techniques to enhance our pipeline and improve response quality. We hope this research will be a start point for future work, ultilizing LLMs to overcome the limitations of existing algorithms and provide a comprehensive analysis that enhances RAID and other storage system performance.

I’m excited to work with all of you and look forward to your suggestions. if you are interested, Here is my proposal

FSA: Benchmarking Fail-Slow Algorithms

Tue, 06 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ruidan Li (primary contact), Kexin Pei

In the realm of modern applications, achieving not only low but also predictable response times is a critical requirement. Performance instability, even when it amounts to just a few milliseconds of delay, can result in violations of Service Level Objectives (SLOs). Redundancy at the RAID group level provides a layer of protection; however, the early identification of potential slowdowns or failures is paramount in minimizing their impact on overall system latency.

Fail-Slow represents a unique type of fault within storage systems, characterized by the system’s ability to continue functioning while progressively deteriorating – its performance significantly drops below expected levels. Notably, fail-slow conditions are responsible for a considerable share of latency tails. Detecting fail-slow faults is particularly challenging, as they can be easily masked by the normal fluctuations in performance. Consequently, the identification of fail-slow faults is a critical area of research, demanding meticulous attention.

Several strategies have been developed to address the fail-slow issue, yet the question of their broad applicability remains. We plan to implement and assess various existing fail-slow detection algorithms, examining their strengths and weaknesses. Our analysis will concentrate on key questions:

How promptly can the algorithm identify a fail-slow symptom? What methods does the algorithm employ to accurately distinguish fail-slow incidents, thereby minimizing false negatives? Through what approach does the algorithm achieve the right sensitivity level to keep false positives in check?

This evaluation aims to shed light on the effectiveness of current methodologies in detecting fail-slow faults, crucial for enhancing system reliability and performance.

Building upon our evaluation of several fail-slow detection algorithms, our objective is to harness advanced machine learning (ML) models to develop a novel algorithm. This initiative seeks to address and potentially compensate for the identified weaknesses in existing methodologies. By focusing on the critical aspects of early detection, accurate differentiation, and optimal sensitivity, we aim to create a solution that reduces both false negatives and false positives, thereby enhancing overall system reliability. This approach represents a strategic effort to not only advance the current state of fail-slow detection but also to contribute significantly to the resilience and performance of storage systems.

Project Deliverable

A Trovi artifact for the existing Fail-Slow detection algorithms on Chameleon Cloud
A GitHub repository containing the full evaluation result
A Google Colab notebook for quick replay