William Nixon | UCSC OSPO

Final Blogpost: Drift Management Strategies Benchmark

Sat, 24 Aug 2024 00:00:00 +0000

Background

Hello there! I’m William and this is my final blog for my proposal “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

If you’re not familiar with it, this project aims to address the issue of model aging, where machine learning (ML) models experience a decline in effectiveness over time due to environmental changes, known as drift. My goal is to design an extensible pipeline that evaluates and benchmarks the robustness of state-of-the-art algorithms in addressing these drifts.

Deliverables

You can find my list of deliverables here:

Final report, this blog is a summarized version of my final report, so do take a look if you’d like to know more!
Github repository, contains code as well as the raw experiment results.
Trovi artifact

Evaluation

Here are some of the graphs that show the performance of every algorithm on the created datasets. For more graphs and figures, you can check out my final report:

CIRCLE: AUE demonstrates stability, maintaining a high accuracy even as the data drifts, which may be due to its ensemble nature. It is even more stable than baseline retraining algorithms. Matchmaker is also able to recover quickly upon experiencing drift, which maybe again due to its ranking the most high performing models to do inference, recovering faster than RetrainWin. On the other hand, DriftSurf experiences several random drops in accuracy, indicating that it can be somewhat unstable.
SINE: Similar to CIRCLE, AUE demonstrates stability throughout the dataset, maintaining a high accuracy even as the data drifts. Matchmaker however was struggling to adapt as fast when encountering such a sudden drift, as it needed some time/windows to recover from the drop. Driftsurf’s performance was notably better than baseline, as unlike them, it was able to recover successfully fairly quickly upon experiencing drift.
CovCon: In CovCon, Matchmaker was able to achieve the best accuracy, as it is able to select the models most relevant to each incoming batch (model trained on the most similar features), performing comparably to retrain window. Most of the other algorithms suffered in this dataset, particularly AUE whose performance is now becoming comparable to the rest of the algorithms and baseline.
IOAdmission: Performance on this dataset was led by AUE, which was able to maintain impressive stability amongst all of the algorithms used. This is followed closely by Matchmaker. The other algorithms used undergo a lot of fluctuations in accuracy.

Findings / Discussion

From the experiments conducted, the findings are as follows:

Matchmaker was able to perform particularly well in the CovCon dataset. This maybe due to its ability to choose the most relevant trained model from its ensemble during inference time. Its training time is also the best compared to other algorithms, especially considering that it keeps data for training an additional random forest model for ranking the models. However, its inference time was the longest amongst all other algorithms. This may be due to the fact that on inference time, one needs to traverse all of the leaf nodes of the random forest used to rank it (computing covariate shift).
AUE was able to perform particularly well in the CIRCLE and IOAdmission dataset. However, it is quite competitive on other datasets too. It’s weighting function which incentives highly relevant models and eviction of less relevant ones may be key. Its inference time is decent compared to other algorithms, being slower than most baselines and Driftsurf, but faster than Matchmaker. However, its training time took the longest amongst other competitors, as it has an expensive weighting function to weight, evict, or retrain models on every retraining.
DriftSurf was performing very similarly to the RetrainWindow baseline, in almost all datasets, except for IO Admission and SINE where it did better. This may be because of the fact that it maintains only at most 2 models every iteration, and as such, its performance was not competitive against the mult-models approach used in Matchmaker and AUE. On the plus side, its inference time is comparable to the baseline single model, having almost no inference overhead compared to most of the competitors out there. Another plausible explanation for the lack of performance is the lack of tuning, such as the number of windows retained, the length of its reactive period, and its reactivity sensitivity threshold. A better performance could be achieved if these parameters were tuned further.

Next Steps

These are some of the potential extensions for this project:

Optimize Matchmaker’s inference time improving Matchmaker’s efficiency, especially in covariate shift ranking, can reduce inference time. Simplifying the random forest traversal could make Matchmaker faster without impacting performance.
Extending the work to include other frameworks like TensorFlow or PyTorch, as it can now only support a scikit-learn base model.

Thank you for reading!

Midterm Blogpost: Drift Management Strategies Benchmark

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m William and I’m thrilled to share the progress on my project, “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

Project Overview

Progress

So far, I’ve generated various synthetic datasets, which include:

CIRCLE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. Each data point is labeled as per the condition (x1 − c1)^2 + (x2 − c2)^2 <= r where the center (c1, c2) and radius r of the circular decision boundary changes gradually over a period of time introducing (gradual) concept drift.
COVCON: This 2-dimensional dataset has covariate shift and concept drift. The decision boundary at each point is given by α ∗ sin(πx1) > x2. We use 10000 points (100 batches, 1000 points per batch). Covariate shift is introduced by changing the location of x1 and x2 (for batch t x1 and x2). Concept drift is introduced by alternating the value of α.
SINE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. In the first context all points below the curve y = sin(x) are classified as positive. The label for the classes are flipped after.

Additionally, I’ve also curated drifting data from the Tencent I/O block trace. These datasets will be used to benchmark model performance under different drift conditions.

The pipeline can receive a base sci-kit learn model, and evaluate their performance on these datasets prequentially. Here are some of the initial results for the performance of the models on these drifting dataset, under a never retraining and retraining, using 1 & 7 past windows. As you can see, model performance degrades upon encountering extreme drift.

Findings

From the experiments conducted so far, the findings are as follows:

A model without retraining struggles to maintain performance when drift occurs.
Retraining on data from previous drifting windows, whether abruptly (SINE) or gradually (CIRCLE), leads to poorer performance, especially evident in the retrain Window, which incorporates data up to 7 windows prior.
However, retraining on previous data proves beneficial in cases of covariate shift (CovCon), allowing the model to better align with the evolving real-world feature distributions.

Next Steps

As the base template for the pipeline and dataset curation is done, as I move forward, my focus will be on:

Implementing three advanced algorithms: AUE (Accuracy Updated Ensemble), MATCHMAKER, and Driftsurf, then integrating them into the pipeline.
Enhancing the benchmarking process by adding more metrics and plots, such as training time and inference time, to better evaluate the strategies.
Packaging the entire experiment into a Chameleon Trovi Artifact, ensuring ease of reproducibility and extension.

Stay tuned for my final blog as I delve deeper into this project!

Developing a Pipeline to Benchmark Drift Management Strategies

Mon, 10 Jun 2024 00:00:00 +0000

With guidance from mentors Ray Andrew Sinurat and Sandeep Madireddy under the LAST project, I aim to develop a pipeline to benchmark the efficacy of various drift management algorithms.

Despite the abundance of literature on this subject, reproducibility remains a challenge due to the lack of available source code. As such, by crafting this pipeline, I aim to create standardized platform for researchers and practitioners to compare several state-of-the-art drift management approaches. Through rigorous testing and benchmarking, we seek to identify the most effective algorithms across a spectrum of drift scenarios, including gradual, sudden, and recurring drift.

This final deliverable of this pipeline will be packaged into a Chameleon Trovi Artifact. The pipeline will also be made easily extensible to cater to additional datasets or any custom-made drift-mitigation methods. This is my proposal for the project.

See you around!