SummerofReproducibility24 | UCSC OSPO

Midterm Report: Simulation, Comparison, and Conclusion of Cache Eviction

Wed, 06 Aug 2025 00:00:00 +0000

Project Overview

CacheBench is a benchmarking suite designed for comprehensive cache performance evaluation, with a particular focus on analyzing the miss ratios of various cache eviction algorithms.

At the core of CacheBench lie two key components: the high-performance cache simulator, libCacheSim, and the extensive open-source cache datasets, which collectively contain over 8,000 traces from diverse applications. This ensures broad coverage across a range of realistic workloads.

Our primary goal is to evaluate all major and widely-used cache eviction algorithms on thousands of traces, in order to gain insights into their behaviors and design trade-offs. Additionally, we aim to identify and distill representative workloads, making benchmarking more efficient and comprehensive for future cache research.

Progress and Pain Points

We began by benchmarking prevalent eviction algorithms, including FIFO, LRU, CLOCK, LFU, Random, Belady (BeladySize), CAR, ARC, LIRS, LHD, Hyperbolic, GDSF, W-TinyLFU, 2Q, SLRU, S3-FIFO, SIEVE, and LeCaR. As we developed the suite, we made progressive improvements to both the simulator and dataset infrastructure. Our progress can be summarized as follows:

Collected miss ratio results for all listed algorithms across 8,000+ traces.
Identified best- and worst-performing traces for each algorithm, and conducted feature analysis of these traces.
Developed Python bindings: To increase accessibility, we provided a Python package that allows users to easily download traces and run simulation analyses using libCacheSim and the cache datasets.

However, analysis remains challenging because there is no universally accepted metric or baseline for objectively comparing cache eviction algorithms’ performance across all workloads.

Next Steps

For the second half of the project, my focus will shift to:

Evaluating More Complex Eviction Algorithms: Having concentrated mainly on static eviction policies so far (which are generally more deterministic and understandable), I will now investigate learning-based eviction algorithms such as LRB and 3L-Cache. These models incorporate learning components and incur additional computational overhead, making simulations slower and more complex.
Detailed Trace Analysis: Since eviction algorithms can have highly variable performance on the same trace, I plan to analyze why certain algorithms excel on specific traces while others do not. Understanding these factors is crucial to characterizing both the algorithms and the workload traces.
Constructing Representative Workload Sets: Based on ongoing simulations and trace analyses, I aim to identify a minimal but representative subset of traces that can serve as a basic evaluation suite, simplifying testing and improving accessibility.

Reflection

This project has truly been the highlight of my summer. By evaluating a wide range of cache eviction algorithms, I’ve significantly deepened my understanding of cache design and its underlying principles.

I’m especially grateful to my mentors for their constant support, patience, and guidance throughout this journey. It’s been a privilege to learn from you!

I’m excited to see the final results of CacheBench!

Building a Benchmarking Suite for Cache Performance Evaluation

Sat, 21 Jun 2025 00:00:00 +0000

Hi! I’m Haocheng Xia, a Computer Science student at the University of Illinois Urbana-Champaign, passionate about the intersection of machine learning and storage systems. Specifically, I’m keen on workload analysis and KV cache management for large language models.

This summer, I’m happy to be a part of SoR 2025 and OSRE 2025. I’m contributing to the CacheBench project. My initiative, ‘Building a Benchmarking Suite for Cache Performance Evaluation,’ will create a robust platform. This involves extensive simulation of existing eviction algorithms using libCacheSim, developing microbenchmarks, and building a user-friendly platform for researchers to effortlessly evaluate novel cache designs. The ultimate goal is to establish a competitive leaderboard.

My contributions will include a comprehensive dataset detailing simulated miss ratios and throughput of current cache eviction algorithms, an extension to libCacheSim for executing microbenchmarks both locally and on our online platform, and the creation and ongoing maintenance of a public web leaderboard. I’m grateful to be mentored by Juncheng Yang and Yazhuo Zhang.

I’m thrilled to be part of building tools that empower users and advance the vision of a more decentralized web. Looking forward to a productive summer!

[Final] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Wed, 18 Sep 2024 00:00:00 +0000

Hello everyone,

In my SoR 2024 project, ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. I’m excited to share the final progress and insights we’ve gathered on tackling scalability bugs in large-scale distributed systems. I aimed to tackle the reproducibility challenges posed by scalability bugs in large-scale distributed systems. Below is a detailed summary of the investigations and findings we’ve conducted on scalability bugs in large-scale distributed systems.

Project Overview

As you may recall, our project, ScaleRep, aimed to tackle the challenge of scalability bugs—those insidious issues that often arise in large-scale distributed systems under heavy workloads. These bugs, when triggered, can lead to significant system issues such as downtime, performance bottlenecks, and even data loss. They are particularly difficult to catch using traditional testing methods.

Our primary focus was on reproducing these bugs, documenting the challenges involved, and providing insights into how these bugs manifest under various conditions. This documentation will help researchers identify, benchmark, and resolve similar issues in the future.

Progress

Since the midterm update, several Apache Ignite bugs have been investigated, some of which have been successfully reproduced and uploaded to Trovi for the research community to access and reuse. Below is the progress on the bugs investigated:

Bugs Investigated

Key Insights & Challenges

Complexity of Scalability Bugs Many scalability bugs involve subtle and complex interactions that are not easily detected in standard testing environments. For instance, IGNITE-20602 only manifested under certain high-load conditions and required a specific workload and environment to reliably trigger the issue. This highlights the importance of large-scale testing when investigating scalability issues.
Dependency and Documentation Gaps We encountered significant challenges with outdated dependencies and incomplete documentation, particularly in older bugs like IGNITE-16072. In these cases, reproducing the bug required extensive modifications or wasn’t feasible without investing disproportionate effort in updating dependencies.
Effectiveness of Trovi and Chameleon Packaging and sharing our reproducible investigations through Trovi and Chameleon have proven highly effective. By providing researchers with pre-configured environments and detailed documentation, we’ve laid the groundwork for future collaboration and further research on these bugs. We expect this to greatly benefit others attempting to reproduce similar issues.
Impact of Speed-Based Throttling Our investigation into IGNITE-16600 revealed several important insights into speed-based throttling and its impact on system performance under high-load conditions. By analyzing the checkpoint starvation and thread throttling mechanisms, we were able to identify areas for improvement in the latest Ignite releases.

Next Steps

Expanding Collaboration: The packaged bugs and replayable Trovi experiments will be made available to the broader research community, encouraging further investigation and enhancements to large-scale distributed systems.

The ScaleRep project has been an exciting journey into the world of scalability bugs, pushing the boundaries of what’s possible in terms of reproducibility and benchmarking. Through this project, we’ve demonstrated the importance of rigorous testing and comprehensive documentation in improving the reliability of distributed systems.

Understanding Data Leakage in Machine Learning: A Focus on TF-IDF

Thu, 05 Sep 2024 00:00:00 +0000

Hello again!

This is my final blog post, and I will be discussing the second material I created for the 2024 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Exploring Data Leakage in Applied ML: Reproducing Examples of Irreproducibility project with Fraida Fund and Mohamed Saeed as my mentors.

This blog post will explore how data leakage can occur during feature extraction, particularly with the commonly used TF-IDF vectorizer, and its impact on model generalization.

Introduction

In machine learning, data leakage is a critical issue that can severely impact model performance. It occurs when information from outside the training dataset is improperly used to create the model, leading to overly optimistic performance during evaluation. One common source of leakage comes from how features, such as those extracted using TF-IDF (Term Frequency-Inverse Document Frequency), are handled. In this post, we’ll explore how data leakage can happen during feature extraction with TF-IDF and how it affects model accuracy.

What is TF-IDF?

TF-IDF is a method used to evaluate how important a word is in a document relative to a collection of documents. It consists of two components:

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Reduces the importance of terms that appear frequently across many documents.

Together, they provide a weighted value for each word, reflecting its importance relative to the dataset.

How Data Leakage Occurs with TF-IDF

Data leakage with TF-IDF happens when the inverse document frequency (IDF) is calculated using the entire dataset (including the test set) before splitting it into training and test sets. This means the model has access to information from the test set during training, leading to artificially inflated results. This is a subtle form of data leakage, as it often goes unnoticed.

For example, when calculating the TF-IDF score, if the word “banana” appears more frequently in the test set but is considered during training, the model downplays its significance. As a result, the model may fail to predict correctly when “banana” is important in the test data.

Why Does This Matter?

If the test data is included when calculating the IDF, the model gains unintended insight into the test set’s word distribution. In real-world scenarios, the test data is supposed to be unseen during training. By allowing the model to see this information, you’re essentially reducing the uncertainty that the model should have about future data.

Impact of Data Leakage on Model Performance

Let’s consider two cases to understand the impact of data leakage in detail:

When a word is rare in the training set but common in the test set: The model will underestimate the importance of this word during training, leading to poor performance when the word is critical in test documents.
When a word is common in the training set but rare in the test set: The model will overemphasize the word during training, leading to poor predictions when the word doesn’t appear as often in unseen data.

Case Study: Data Leakage in TF-IDF

To see this effect in action, consider a small toy dataset where the presence of the word “banana” determines the label. If the word “banana” appears in a sentence, the label is 1; otherwise, the label is 0. Using TF-IDF to vectorize the text, we train a machine learning model to predict this label.

In the first scenario, we calculate the TF-IDF using the entire dataset before splitting it into training and testing sets. This causes data leakage since the model now knows the distribution of words across both sets. For instance, if “banana” is more common in the test set than the training set, the IDF score for “banana” will be lower across the entire dataset, leading the model to downplay its importance.

In the second scenario, we calculate TF-IDF only on the training set, ensuring that the test set remains unseen. This preserves the integrity of the test set, giving us a more realistic evaluation of the model’s performance.

In both scenarios, the model’s accuracy is drastically different. When leakage is present, performance is artificially high during training but poor when tested on unseen data. Without leakage, the model generalizes better, as it is evaluated on truly unseen data.

Avoiding Data Leakage

Avoiding data leakage is essential for building reliable machine learning models that generalize well to new data. Here are a few guidelines to help prevent leakage:

Split the dataset before feature extraction: Always divide your data into training and test sets before applying any feature engineering techniques.
Ensure proper cross-validation: When using cross-validation, ensure that the training and test splits do not overlap in any way that can leak information between them.
Be cautious with time-series data: In time-series models, avoid using future data to predict past events, as this can lead to leakage.

Conclusion

Avoiding data leakage is crucial for building robust machine learning models. In the case of TF-IDF, ensuring that feature extraction is done only on the training set and not on the entire dataset is key to preventing leakage. Properly addressing this issue leads to better generalization and more reliable models in real-world applications.

This blog post provided a case study on how TF-IDF can introduce data leakage and why it’s important to carefully handle your dataset before feature extraction. By splitting your data properly and ensuring that no test data “leaks” into the training process, you can build models that truly reflect real-world performance.

Thanks for reading!

Reproducing and addressing Data Leakage issue : Duplicates in dataset

Fri, 23 Aug 2024 00:00:00 +0000

Hello!

In this blog post, I will explore a common issue in machine learning called data leakage, using an example from the paper:

Benedetti, P., Perri, D., Simonetti, M., Gervasi, O., Reali, G., Femminella, M. (2020). Skin Cancer Classification Using Inception Network and Transfer Learning. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12249. Springer, Cham. https://doi.org/10.1007/978-3-030-58799-4_39 arXiv

Overview of the Paper

In this paper, the authors use transfer learning on a pretrained convolutional neural network (CNN) to classify skin lesions in dermatoscopic images from the HAM10000 (Human Against Machine with 10,000 training images) dataset. The paper reports a final accuracy of 78.9% on the validation set.

While this reported result appears to be impressive, there are concerns regarding the validity of this performance metric due to data leakage. Data leakage occurs when the model is trained or evaluated on data that it would not have access to during real-world deployment, leading to an overestimation of the model’s true performance.

Identifying Data Leakage in the Original Paper

Upon closer inspection, it appears that the original experiment suffers from data leakage in two significant ways:

Duplicate Images in Training and Validation Sets:

The HAM10000 dataset contains near-duplicate images of the same lesions in both the training and validation sets. This results in the model seeing very similar images during training and then again during validation. Consequently, the model’s performance is artificially inflated because it has already been “trained” on images similar to those in the validation set, making the task easier than it should be.
Using the Validation Set for Early Stopping and Final Evaluation:

Another critical issue is the use of the validation set for both early stopping and final model evaluation. Early stopping is a technique where training is halted when the model’s performance on a validation set no longer improves, preventing overfitting. However, if this same validation set is later used to evaluate the model’s final performance, it can lead to overfitting on the validation data itself, resulting in an overly optimistic estimate of model accuracy.

Our Reproduction and Results

To demonstrate the impact of these data leakage issues, we reproduced the experiment with corrected methodologies:

Corrected Data Split: We ensured that there were no duplicate images between the training and validation sets. This setup is crucial to simulate a realistic scenario where the model encounters completely unseen data during validation.
Separate Validation and Test Sets: We introduced a distinct test set to evaluate the final model performance, independent of the data used for early stopping.

Results Comparison

	Original results	Our results
Accuracy	78.9%	78.6%
Number of epochs	Approx. 42 epochs	40 epochs
Training size	Unknown	7000 samples
Validation size	478 samples	478 samples
Confusion martix

Analysis of the Results

While our reproduced accuracy of 78.6% is close to the original reported accuracy, it is based on a properly separated training and validation set, avoiding the data leakage pitfalls of the original paper. The slight drop in accuracy further highlights the overestimation of the original model’s performance due to data leakage.

Moreover, using a separate test set for final evaluation provides a more reliable measure of the model’s ability to generalize to new, unseen data. The confusion matrices show that our model’s performance is consistent across different lesion classes, confirming the robustness of the evaluation.

Conclusion

Data leakage is a common and often overlooked problem in applied machine learning, leading to misleading performance metrics and irreproducible results. By carefully examining and correcting these issues in our reproduction, we hope to provide a clearer understanding of the importance of proper data handling and validation practices.

It is crucial for researchers and practitioners to be vigilant about data leakage and ensure that their models are trained, validated, and tested under realistic conditions. This not only ensures the credibility of their results but also enhances the real-world applicability of their models.

Thank you for reading, and stay tuned for more insights on machine learning reproducibility!

[MidTerm] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Thu, 01 Aug 2024 00:00:00 +0000

Hey there, scalability enthusiasts and fellow researchers! I’m excited to share my progress on the ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. Here’s a glimpse into how we’re tackling scalability bugs in large-scale distributed systems.

Project Overview

Large-scale distributed systems are the backbone of modern computing, powering various applications and services. However, these systems often face challenges related to reliability and performance, particularly scalability bugs. These bugs manifest in large-scale deployments, causing issues such as system downtime, reduced responsiveness, and data loss. Traditional bug-finding methods fall short in detecting these bugs, which are triggered by factors like component count, system load, workload size, recovery protocol reliability, and intermediate failure magnitude.

Our project, ScaleRep, aims to address these challenges by analyzing recent scalability issues from ten popular open-source large-scale systems. We are providing detailed accounts of bug reproduction experiences, identifying common challenges, and developing protocols for triggering and quantifying the impact of scalability bugs.

Progress Highlights

So far, I have been working on the following bugs and have successfully uploaded some of them to Trovi. Here’s a brief overview of my progress:

Bugs Worked On:

IGNITE-20614: Uploaded to Trovi Trovi Link
IGNITE-17407: Uploaded to Trovi Trovi Link
IGNITE-20692
IGNITE-16600
IGNITE-16072

What is Chameleon and Trovi?

Chameleon is a configurable experimental environment for large-scale cloud research. It provides a platform for running and testing distributed systems at scale, allowing researchers to reproduce and study scalability issues in a controlled setting.

Trovi is a platform that facilitates the sharing of reproducible artifacts. By uploading our bug reproduction artifacts to Trovi, we enable other researchers to easily reproduce scalability bugs, fostering collaboration and advancing the field of distributed systems research.

Short Description of the Bugs

IGNITE-20614 This bug refers to an issue where the Ignite service grid experiences degradation or hangs under specific conditions related to service deployment and node restarts.

Root Causes: The root cause is a race condition during the deployment and undeployment of services in the service grid, particularly when nodes are restarted or when there is a significant amount of concurrent service deployment and undeployment activity.

Impact: The impact of this bug includes potential service grid hangs, degraded performance, and possible inability to deploy or undeploy services as expected, which can disrupt the overall operation of the Ignite cluster.

Fix: The fix involves adding proper synchronization mechanisms to handle concurrent service deployment and undeployment operations more gracefully, ensuring that race conditions are avoided.

IGNITE-17407 This issue pertains to the incorrect behavior of the Ignite thin client protocol, particularly when dealing with binary objects and schema changes.

Root Causes: The root cause lies in the way the thin client handles binary object schema changes. The thin client was not correctly updating the schema cache, leading to inconsistencies and incorrect behavior when deserializing binary objects.

Impact: Users of the thin client may experience issues with binary object deserialization, leading to potential data corruption, incorrect query results, and overall application instability.

Fix: The fix involves updating the thin client protocol to properly handle schema changes by ensuring that the schema cache is correctly updated and synchronized with the server.

IGNITE-20692 This bug is related to the performance degradation observed in the Ignite SQL engine when executing certain complex queries.

Root Causes: The root cause is identified as inefficient query planning and execution strategies for specific types of complex SQL queries, leading to excessive resource consumption and slow query performance.

Impact: Users running complex SQL queries may experience significant performance degradation, leading to slower response times, increased CPU and memory usage, and potentially impacting the overall performance of the Ignite cluster.

Fix: The fix involves optimizing the SQL query planner and executor to handle complex queries more efficiently, including better indexing strategies, improved query plan caching, and more effective resource management during query execution.

IGNITE-16600 This bug involves an issue with speed-based throttling in the checkpoint process, leading to possible starvation of the checkpoint thread under heavy load.

Root Causes: The root cause is the absence of proper mechanisms to wake up throttled threads when they no longer need to be throttled, resulting in unnecessary waiting and potential starvation of the checkpoint thread.

Impact: Under heavy load, the checkpoint process can be significantly delayed, leading to slower checkpoint completion times, increased risk of data loss, and overall degraded performance of the Ignite cluster.

Fix: The fix includes implementing methods to wake up throttled threads when they no longer need to be throttled (tryWakeupThrottledThreads and shouldThrottle), ensuring that the checkpoint process can proceed without unnecessary delays.

IGNITE-16072 This issue pertains to the incorrect handling of SQL queries involving NULL values in the Ignite SQL engine, leading to unexpected query results.

Root Causes: The root cause is an incorrect implementation of SQL semantics for handling NULL values in certain query conditions, particularly in the presence of complex joins and subqueries.

Impact: Users may experience incorrect query results when NULL values are involved, leading to potential data inconsistencies and incorrect application behavior.

Fix: The fix involves correcting the SQL engine’s implementation to properly handle NULL values according to the SQL standard, ensuring that queries involving NULL values produce the expected results.

What’s Next?

Continued Bug Reproduction:

Focus on reproducing more scalability bugs

Documentation of Challenges:

Breakdown specific challenges encountered during attempts to reproduce scalability bugs.
Categorize challenges, including technical complexities, environmental dependencies, and lack of documentation in bug reports.

Finalizing Project Deliverables:

Package artifacts using Jupyter notebook scripts for convenient replay of investigation steps.
Upload the package to Trovi for replayable artifacts, enabling other researchers to easily reproduce scalability bugs for our benchmark applications.

Conclusion

The ScaleRep project has made significant strides in reproducing and benchmarking scalability bugs in large-scale distributed systems. By successfully reproducing and documenting scalability bugs, we are contributing valuable insights to the research community, aiding in the development of more robust distributed systems. The protocols and methodologies devised in this project will serve as valuable tools for researchers exploring similar issues.

Stay tuned for more updates as we continue to tackle scalability bugs and improve the reliability and performance of large-scale distributed systems.

Data leakage in applied ML: reproducing examples of irreproducibility

Fri, 14 Jun 2024 00:00:00 +0000

Hello,

I am Kyrillos Ishak I am happy to be part of SOR 2024, I am working on Data leakage in applied ML: reproducing examples of irreproducibility project. My proposal was accepted.

I am excited to work with Fraida Fund and Mohamed Saeed as my mentors. The objective of the project is to develop educational resources that can be adjusted by professors/instructors to explain specific data leakage problems. This involves ensuring the reproducibility of certain research papers that contain data preprocessing issues, then fixing these issues to demonstrate how they can affect the results.

Data leakage is a problem caused when information from outside the training dataset is used to create the model. This issue can lead to overly optimistic performance estimates and, ultimately, models that do not perform well on new, unseen data.

Despite the importance of addressing data leakage, many people from fields not closely related to computer science, are often unfamiliar with it, even if they are aware of best practices for data preprocessing. Developing educational materials on this topic will greatly benefit them.

I am excited to dive into the topic of data leakage in machine learning. Throughout the summer, I will be sharing regular updates and insightful blog posts on this subject. Stay tuned for more information!

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 08 Jun 2024 00:00:00 +0000

Hi! I’m Zahra, an undergraduate at Universitas Dian Nuswantoro, Indonesia. As part of the ScaleRep my proposal under the mentorship of Bogdan "Bo" Stoica and Yang Wang aims to systematically understand, characterize, and document the challenges associated with scalability bugs in large-scale distributed systems.

ScaleRep proposes a two-fold strategy to address scalability bugs in large-scale distributed systems. First, Bug Analysis and Documentation involves studying recent scalability issues across popular open-source systems such as Cassandra, Hadoop, HDFS, Ignite, and Spark to understand bug causes, symptoms, and solutions. This includes pinpointing common challenges hindering bug reproduction and devising protocols to trigger and measure scalability bug impacts. Second, Implementation and Artifact Packaging focuses on identifying, reproducing, and documenting scalability bugs, then packaging artifacts with Chameleon Trovi. This method emphasizes precise bug analysis, establishing reproducible environments, and detailed documentation to ensure artifact reliability and usability.