SoR'24 | UCSC OSPO

Reflecting on the ScaleRep Project: Achievements and Insights

Mon, 02 Sep 2024 00:00:00 +0000

Hello everyone,

As we reach the conclusion of our ScaleRep project, I want to take a moment to reflect on the journey we’ve undertaken and the significant milestones we’ve achieved. Throughout this project, our primary focus was on identifying, reproducing, and analyzing scalability bugs in cloud systems such as Cassandra, HDFS, and Hadoop. Under the mentorship of Professor Yang Wang and Bogdan “Bo” Stoica, we have gained valuable insights into the complexities of scalability issues and their impact on large-scale distributed systems.

Key Accomplishments

Over the course of the project, we delved into various aspects of scalability bugs, reproducing some of the most challenging issues faced by cloud systems. One of our notable accomplishments was the successful reproduction and validation of developer fixes for several critical bugs in HDFS. These included:

1. Throttling Bugs in HDFS:

We investigated HDFS-17087, where the absence of a throttler in led to unregulated data reads, causing potential performance degradation. By reproducing the bug and applying the developer’s patch, we were able to observe significant improvements in system stability.DataXceiver#readBlock

2. Reducing DataNode Load:

HDFS-16386 was another crucial bug we worked on, which involved reducing the load on DataNodes when was working. By analyzing the effects of high CPU and memory usage, we proposed and validated a solution that reduced the number of concurrent threads, ultimately improving the DataNode’s performance.FsDatasetAsyncDiskService

3. Improving Log Throttling:

In HDFS-16872, we addressed excessive logging caused by unshared instances of . By making a static member, we were able to share throttling across instances, reducing unnecessary log entries and improving system efficiency.LogThrottlingHelperLogThrottlingHelper

Insights and Learnings

1. Systematic Bug Reproduction:

One of the most critical aspects of our work was developing a systematic approach to bug reproduction. This involved carefully setting up the environment, applying patches, and validating results through detailed monitoring and analysis. Our reproducible artifacts and investigation scripts will serve as a resource for future researchers and developers.

2. Impact of Throttling Mechanisms:

Our exploration of throttling bugs highlighted the importance of accurate throttling mechanisms in maintaining system performance and stability. Small issues, such as incorrect data rate calculations, can have significant ripple effects on system behavior, emphasizing the need for precise and effective solutions.

3. Collaboration and Open Source Contribution:

Working on an open-source project like ScaleRep underscored the importance of collaboration within the community. The bugs we analyzed and fixed not only improved the systems we worked on but also contributed to the broader effort of enhancing the reliability of cloud systems.

Conclusion

As we wrap up the ScaleRep project, I am proud of the progress we have made and the contributions we have delivered to the open-source community. The knowledge and experience gained from this project will undoubtedly shape our future endeavors in the field of distributed systems and cloud computing. I am grateful for the guidance and support provided by Professor Yang Wang and Bogdan “Bo” Stoica throughout this journey.

Thank you for following along, and I look forward to continuing to explore the future of scalable and reliable cloud systems!

Static and Interactive Visualization Capture

Fri, 30 Aug 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. During summer of 2024, I worked closely with Professor David Koop on the project titled Reproducibility in Data Visualization. We explored multiple existing solutions and tested different stratergies and made great progress in the capture of visualiations using a relatively less used method of embedding visualization meta-information into the final resultant visualizations jpg as a json object.

Progress and Challenges

Static Visualization Capture

We successfully developed a method to capture static visualizations as .png files along with embedded metadata in a JSON format. This approach enables seamless reproducibility of the visualization by storing all necessary metadata within the image file itself. Our method supports both Matplotlib and Bokeh libraries and demonstrated near-perfect reproducibility, with only a minimal 1-2% pixel difference in cases where jitter (randomness) was involved.

Interactive Visualization Capture

For interactive visualizations, our focus shifted to capturing state changes in Plotly visualizations on the web. We developed a script that tracks user interactions (e.g., zoom, box, lasso, slider) using event listeners and automatically captures the visualization state as both image and metadata files. This script also maintains a history of interactions to ensure reproducibility of all interaction states.

The challenge of capturing web-based visualizations from platforms like ObservableHq remains, as iframe restrictions prevent direct access to SVG elements. Further exploration is needed to create a more robust capture method for these environments.

Future Work

We aim to package our interactive capture script into a Google Chrome extension.

Temporarily store interaction session files in the browser’s local storage.

Enable users to download captured files as a zip archive, using base64 encoding for images.

Conclusion

The last summer, we made significant strides in enhancing data visualization reproducibility. Our innovative approach to embedding metadata directly into visualization files offers a streamlined method for recreating static visualizations. The progress in capturing interactive visualization states opens new possibilities for tackling a long-standing challenge in the field of reproducibility.

Exploring Throttling Bugs in HDFS: Reproducing Developer Fixes

Mon, 22 Jul 2024 00:00:00 +0000

Scalability is a critical concern for large-scale distributed systems like the Hadoop Distributed File System (HDFS). Throttling bugs, which affect the system’s ability to manage data transfer rates effectively, can lead to performance issues and system instability. In my recent work, I focused on reproducing the effects of two specific throttling bugs in HDFS, which were fixed by developers. This blog provides an overview of these bugs and the process of reproducing their effects to validate the fixes.

HDFS-17087: Missing Throttler in DataXceiver#readBlock

One of the throttling bugs I explored was HDFS-17087. The DataXceiver#readBlock function in HDFS lacked a throttler, resulting in unregulated data reads. This absence could lead to potential performance degradation under heavy loads. The developer fixed this issue by adding a throttler to regulate the data transfer rate. In my work, I reproduced the bug and observed the system’s behavior both before and after applying the developer’s patch. The results showed a significant improvement in stability and performance post-fix.

HDFS-17216: Incorrect Data Rate Calculation

Another crucial bug was HDFS-17216. The issue stemmed from the use of integer division in the getBytesPerSec function, which caused incorrect speed calculations and failed to trigger the throttle, resulting in overspeed. The developer addressed this by switching from integer to float for calculating the elapsed time, ensuring accurate speed measurements. I reproduced the conditions that highlighted the bug’s effects and compared the system’s performance with and without the fix. The post-fix results confirmed that the throttling mechanism worked correctly, effectively preventing overspeed.

Conclusion

Reproducing these throttling bugs and validating the developer fixes was a vital step in understanding their impact on HDFS’s scalability. The improvements observed in system stability and performance underscore the importance of accurate throttling mechanisms. This work contributes to the broader effort of maintaining robust and scalable distributed systems, ensuring they can handle increasing loads efficiently.

Reproducibility in Data Visualization

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. For the last month and a half I have been working closely with Professor David Koop on the project titled Reproducibility in Data Visualization. I’m thrilled to be able to make my own little mark on this amazing project and aid in exploring solutions to capture visualizations in hopes of making reproducibility easier in this domain.

Progress and Challenges

The last month and a half have mostly been spent trying to explore best possible solutions to facilitate the reproducibility of STATIC visualizations from local sources and/or the web. We have taken inspiration from existing work in the domain and successfully captured meta-information required to ensure reproducibility in the regenerated visualizations from the said metadata. The metadata extracted is saved into the generated .png figure of the visualization therefore allowing reproducibility as long as you have (a) The original dataset (b) The generated .png of the visualization. Every other information is stored inside the .png file as a json object and can be used to regenerate the original image with a very high accuracy.

The problem however remains with visualizations where randomness such as jitter is involved. Capturing the randomness has not been 100% successful as of now, and we are looking into options to ensure the capture of certain plots that contains randomness.

The following images can be used to highlight some results from our reproducibility experiments: Original Histogram using Matplotlib on the iris dataset:

Reproduced Histogram using metainformation from the original:

The next steps

We have already started looking into solutions and ways to capture visualizations from the web i.e. from platforms such as ObservableHq and use these experiments to transition into capturing interactive visualizations from the web.

Capturing user interactions and all states in an interactive visualization can prove to be very useful as it is a very known pain-point in the reproducibility community and has been a challenge that needs to be solved. My next steps involve working on finding a solution to capture these interactive visualizations especially those living on the web and ensuring their reproducibility.

Reproducing and benchmarking scalability bugs hiding in cloud systems

Mon, 10 Jun 2024 00:00:00 +0000

Hello there!

I am Shuang Liang, a third-year student studying Computer and Information Science at The Ohio State University. My passion lies in cloud computing and high-performance computing, areas I have explored extensively during my academic journey. I have participated in various projects and competitions, which have honed my technical skills and deepened my interest in distributed systems.

As part of the ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems, my proposal under the mentorship of Professor Yang Wang and Bogdan "Bo" Stoica aims to tackle the critical challenges posed by scalability bugs in systems like Cassandra, HDFS, and Hadoop. These bugs can lead to severe operational issues such as system downtime and data loss, particularly as systems scale up.

The project goals include systematically analyzing and documenting scalability bugs, developing protocols to effectively trigger and quantify the impact of these bugs, and creating reproducible artifacts and detailed investigation scripts to aid in bug analysis.

Our project will involve rigorous bug report analysis, reproduction of scalability bugs, and a comparative study of system behaviors before and after bug fixes. We aim to develop methodologies that enhance the reliability and performance of large-scale distributed systems, providing valuable insights and resources to the open-source community.

Stay tuned to explore the future of reliable and scalable distributed systems!