Xikang Song | UCSC OSPO

Final Blog: FSA - Benchmarking Fail-Slow Algorithms

Wed, 14 Aug 2024 00:00:00 +0000

Introduction

Hello! I hope you’re enjoying the summer as much as I am. I’m excited to join the SOR community as a 2024 contributor. My name is Xikang Song, and I’m thrilled to collaborate with mentors Ruidan Li and Kexin Pei on the FSA-Benchmark project. This project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. Throughout this journey, we tested a broad range of algorithms, from traditional approaches to state-of-the-art techniques, using a robust evaluation system to compare their effectiveness.

In the first half of the project, I focused on implementing and testing different machine learning models for detecting disks at high risk of fail-slow anomalies. This involved setting up initial models such as the Cost-Sensitive Ranking Model and Multi-Prediction Models, and beginning to explore LSTM networks for analyzing input disk data.

In the second half, I built upon this foundation by refining the evaluation processes, exploring advanced models like PatchTST, and investigating the potential of large language models (LLMs) for detecting subtle fail-slow conditions in storage systems. This blog post will summarize the key achievements, findings, and comparisons with baseline models from this phase.

Key Achievements

Comprehensive Benchmarking and Evaluation:
- I extended the benchmarking framework to evaluate multiple algorithms across 25 different data clusters on PERSEUS. This process involved generating and analyzing heatmaps that visualized the precision and recall of each model under various settings, providing a clear understanding of each approach’s strengths and limitations.
Exploration of Advanced Machine Learning Models:
- LSTM Model: I implemented the Long Short-Term Memory (LSTM) model, specifically designed for sequential data, to capture temporal dependencies in disk performance metrics. This model was used to predict potential fail-slow anomalies by analyzing historical data. Using Mean Squared Error (MSE) as a risk indicator, the LSTM model outperformed baseline approaches like the Cost-Sensitive Ranking Model and Multi-Prediction Models, especially in clusters where latency patterns between faulty and normal disks were distinct, such as in Cluster_P. This resulted in a higher precision and fewer false positives. However, in clusters with more complex and overlapping data distributions, like Cluster_L, the LSTM model’s performance diminished, similar to that of the baseline models
- PatchTST Model: I also introduced and evaluated the PatchTST model, which is built on a transformer-based architecture known for its ability to handle sequential data by capturing long-range dependencies and intricate temporal patterns. Unlike traditional models, PatchTST processes time series data in segments or “patches,” enhancing its ability to predict disk behavior over extended periods. Like the LSTM model, PatchTST uses outlier MSE values to assess disk risk. In clusters with a clear separation between faulty and normal disks, PatchTST outperformed baseline models by effectively identifying faulty patterns. However, similar to the LSTM model, PatchTST encountered difficulties in clusters with significant data overlap, such as Cluster_L.
Investigation into Large Language Models (LLMs):
- I explored the use of GPT-4-o-mini for fail-slow detection. While large language models (LLMs) showed potential, particularly in reducing false positives and improving precision over baseline models, they did not consistently outperform specialized models like LSTM and PatchTST in this context. LLMs struggled with recall, especially as thresholds increased, revealing the challenges of adapting LLMs to time series data. This limitation arises because LLMs are primarily trained for natural language generation tasks, not for analyzing time series data. As a result, their ability to fully capture anomalies is limited. To improve their effectiveness, we need to develop methods that help LLMs better understand time series data. For example, incorporating statistical information about each disk’s performance could enhance LLMs’ understanding, leading to better precision in fail-slow detection.

Conclusion and Future Work

The work in this project demonstrated that while advanced machine learning models like LSTM and PatchTST offer significant potential for detecting fail-slow conditions, challenges remain in ensuring consistent performance across diverse clusters. Compared to baseline models, these advanced approaches generally provided better precision and recall, especially in clusters with distinct data patterns between faulty and normal disk performance time series. However, the persistent difficulties in more complex clusters indicate the need for further refinement.

Moving forward, future work will focus on refining these models, particularly in improving their performance in challenging clusters like Cluster_L. Additionally, I plan to further explore techniques such as prompt engineering for LLMs to better tailor them for time series analysis and fail-slow detection tasks.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the FSA_BENCHMARK GitHub Repository.
Jupyter Notebook: A notebook to reproduce the experiments and benchmarks on Chameleon: Chameleon Experiment Notebook.
Final Report: Comprehensive algorithm performance evaluation for all methods in FSA-Benchmarking Final Report.

Halfway Blog: FSA: Benchmarking Fail-Slow Algorithms

Tue, 23 Jul 2024 00:00:00 +0000

Introduction

Hi, I’m Xikang Song, a 2024 SoR contributor to the project, working with mentors Ruidan Li and Kexin Pei. Our FSA-Benchmark project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. We will benchmark a range of machine learning algorithms, from traditional to advanced methods, and compare the results using a comprehensive evaluation system. This will provide a clear view of how machine learning impacts critical error detection in RAID systems.

Motivation

Fail-slow issues in storage systems , where a disk operates at a significantly reduced speed without completely failing, are subtle and can manifest as consistently higher latency compared to peer disks or recurrent abnormal latency spikes. These issues are challenging to detect but can significantly degrade overall system performance over time. Fixed thresholds are ineffective because latency distributions vary across different clusters, leading to thresholds that are either too low or too high, resulting in numerous false alerts. Therefore, we are enthusiastic about using machine learning models to analyze disk performance data. Machine learning algorithms can deeply learn the trends in the data, providing better detection capabilities.

Current Progress and Challenges

Algorithm Implementation:

Cost-Sensitive Ranking Model: Inspired by the paper “Improving Service Availability of Cloud Systems by Predicting Disk Error” presented at the USENIX ATC ‘18 conference, this model ranks disks based on fail-slow risk.
Multi-Prediction Models: Drawing from “Improving Storage System Reliability with Proactive Error Prediction” presented at the USENIX ATC ‘17 conference, this approach uses multiple traditional machine learning models to evaluate disk health using diverse features. Various models were tested, with the Random Forest classifier proving most effective.
LSTM Model: This model employs Long Short-Term Memory (LSTM) networks, trained on the first day’s data for each cluster and evaluated on data spanning all days. It captures temporal dependencies to accurately predict fail-slow anomalies over time.

Comprehensive Evaluation:

Collected outputs from all algorithms on Chameleon for Perseus data A to Y (25 clusters).
Parsed the outputs through a comprehensive evaluation system, recording the true/false positives/negatives.
Plotted heat maps to show precision and recall with different look-back days and alert threshold settings.
Compared the performance across different clusters to draw conclusions.

Packaging Code:

Packaged all the code into a Trovi Jupyter notebook, including the Chameleon server setup, to provide clear steps for running the code and reproducing the experiments. All algorithm testing and result parsing can be easily done here.

Challenges

Initially, I was unsure how to evaluate the performance of different algorithms. Ruidan Li provided comprehensive guidance on collecting all the results uniformly and parsing them to gather true/false positives/negatives. This approach enabled us to derive meaningful metrics and plot heatmaps for precision and recall. I learned the scientific method of benchmarking performance, and I am grateful for the guidance.

Future Steps

Further Investigation of Advanced Algorithms

We plan to explore advanced algorithms such as PatchTST. This will involve systematically collecting outputs and conducting comprehensive benchmarking to assess their performance in identifying fail-slow anomalies.

Transition to Large Language Models (LLMs)

Recognizing the limitations of traditional machine learning methods, we intend to transition to utilizing Large Language Models (LLMs). LLMs have demonstrated superior capabilities in understanding complex patterns and making accurate predictions. We anticipate that incorporating LLMs into our analysis will enhance our ability to detect and predict fail-slow anomalies more accurately, leading to better overall system reliability.

FSA: Benchmarking Fail-Slow Algorithms

Wed, 12 Jun 2024 00:00:00 +0000

Hi everyone! I’m Xikang, a master’s CS student at UChicago. As a part of FSA benchmarking Project, I’m thrilled to be a contributor to OSRE 2024, collaborating with Kexin Pei, the assistant Professor of Computer Science at Uchicago and Ruidan, a talented PhD student at UChicago.

This summer, I will focus on integrating some advanced ML into our RAID slowdown analysis. Our aim is to assess whether LLMs can effectively identify RAID slowdown issues and to benchmark their performance against our current machine learning algorithms. We will test the algorithms on Chameleon Cloud and benchmark them.

Additionally, we will explore optimization techniques to enhance our pipeline and improve response quality. We hope this research will be a start point for future work, ultilizing LLMs to overcome the limitations of existing algorithms and provide a comprehensive analysis that enhances RAID and other storage system performance.

I’m excited to work with all of you and look forward to your suggestions. if you are interested, Here is my proposal