osre24 | UCSC OSPO

[Final Report] Automated Reproducibility Checklist support within StatWrap

Sat, 02 Nov 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share my final updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

This project introduces customizable reproducibility checklists in StatWrap, enabling metadata-driven and user-guided generation of checklists. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklist to ensure their work is reproducible.

Project Links

Explore the StatWrap project repository and my contributions during GSoC ‘24:

Progress And Achievements

During the timeline of this project, I worked on designing the interface for the checklist page and the data structure to support the project needs.

The interface was designed with user needs in mind, featuring components such as:

URLs component to manage external links or file URIs, attached to the project.
Images component to display project image files.
Checklist Notes component to manage user-added notes.

All these assets (Files, URLs, Images) can be added to each checklist statement using the existing assets and external resources(urls) present in the project.

Additionally, for each checklist item, StatWrap runs relevant scans to provide meaningful data based on its requirements. For example, for the item, “All the software dependencies for the project are documented,” StatWrap scans project files to list the languages and dependencies detected. For each checklist statement supported in StatWrap, we implement methods to retrieve specific information by scanning project data. StatWrap currently supports six such checklist statements identified as foundational for ensuring research reproducibility. Additionally, the checklist can be exported as a PDF summary, generated by StatWrap using the checklist data, with options to include notes.

Future Prospects

As the project concludes, several areas for growth have emerged:

Expanding language support within StatWrap. While StatWrap already includes key languages used in research, there is always a scope to extend compatibility to cover even more technologies.
Options to export a data-extensive report that includes checklist and their associated scan results. These and other enhancements, like adding new checklist statements with their scanning methods, will extend StatWrap’s impact on reproducibility in research.

Earlier Blogs

If you’re interested in seeing the project’s evolution, check out my earlier posts:

Thank you for reading!

[MidTerm] StatWrap: Automated Reproducibility Checklists Generation

Mon, 16 Sep 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share progress updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

The project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Progress

Over the past few months, my mentors and I have worked on developing the interface for the checklists page and designed key components to support our project goals. We’ve implemented logic that iterates over each checklist item, displaying its statement along with Boolean controls (Yes/No buttons) for user interaction.

We’ve also developed components to display attached images and URLs linked to each checklist item. Additionally, we’ve integrated a notes feature that allows users to add, edit, and view project-related notes. Currently, we are writing methods to integrate real-time project data into the checklists. For example, one method we’ve implemented scans project files (assets) to detect the languages used.

What’s Next?

As we move closer to the final evaluation phase, our focus will be on the following objectives:

Implement methods for each checklist item, integrating real-time data from the project data to auto-populate checklist answers.
Enhance the Attached Images component to allow users to select and attach existing image assets from the project.
Display the results of the scans for each checklist item, providing users with detailed outputs based on the automated analysis.

Stay tuned for further updates as we continue developing this feature set! 🚀

Final Blog: BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Qianru! I have been contributing to the BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking project under the mentorship of Ziheng Duan. My project aims to provide a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics.

Motivation and Overview

The “BenchmarkST” project was driven by the need to address a critical challenge in spatial transcriptomics: the impact of sparse data on downstream tasks, such as spatial domain identification. Sparse data can significantly degrade the performance of these tasks. For example, in a 10X Visium dataset of human brain Dorsolateral Prefrontal Cortex (DLPFC), using the complete dataset with GraphST (a state-of-the-art clustering method) for clustering resulted in an ARI (Adjusted Rand Index) of 0.6347. However, when using only 20% of the data—a common scenario—the performance dropped dramatically to 0.1880. This stark difference highlights the importance of effective gene imputation, which can help restore the lost information and improve the accuracy of downstream analyses.

To tackle this issue, the BenchmarkST project led to the creation of the Impeller package. This package provides a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics, offering preprocessed datasets, reproducible evaluation methods, and flexible inference interfaces. It spans across different platforms, species, and organs, aiming to enhance the integrity and usability of spatial transcriptomics data.

What Was Accomplished

Development of the Impeller Package

Data Aggregation and Preprocessing:

We aggregated and preprocessed spatial transcriptomic datasets from multiple platforms (10X Visium, StereoSeq, SlideSeqV2), species (human, mouse), and organs (Dorsolateral Prefrontal Cortex, olfactory bulb). These datasets are readily available for download within the package.

Unified Evaluation Framework:

A reproducible framework was developed, integrating methods such as K-Nearest Neighbors (KNN) and the deep learning-based Impeller method, enabling users to easily evaluate the performance of different gene imputation techniques.

Inference Interfaces:

We provided interfaces that allow users to apply gene imputation on custom datasets, offering the flexibility to predict any gene in any cell, maximizing the utility for diverse research needs.

Code Contributions and Documentation

Repository:

All code related to the Impeller package has been committed to the Impeller repository.

Link to Versions:

Here you can find all the versions made during the project, with detailed descriptions of each change.

README.md:

Detailed documentation on how to use the Impeller package, including installation instructions, usage examples, and explanations of the key components.

Final Blog: ML in Detecting and Addressing System Drift

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Joanna! I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces.

Methodology

Here is some background on my project: Model drift, or the degradation of model performance, is typically caused by data drift, which is a shift in the input distribution, and concept drift, which is a change in the relationship between input and output. This project focuses specifically on data drift, aiming to design a pipeline for evaluating drift detection algorithms on system traces. The goal is to benchmark different drift detection algorithms and have a better understanding of the features of system traces. The project is divided into two main parts: dataset construction and algorithm benchmarking.

PART 1: Dataset Construction

To benchmark drift detection algorithms in system data, it’s important to recognize that system trace data is inherently different from other data types, often containing more noise, which can complicate detection efforts. Therefore, constructing a labeled dataset specific to system data is crucial. In our case, we utilize the Tencent I/O block trace data as the dataset. This raw data was processed to extract timestamps along with various features such as IOPS, write size ratio, read write ratio, and etc., which were then used to create a data drift dataset.

I constructed this dataset by labeling segments of the trace data as either exhibiting drift or not. To identify where the drift occurs and to help construct the dataset, I employed several offline drift detection algorithms, including Kolmogorov-Smirnov, Cramer-von Mises, KL-Divergence, and Jensen-Shannon Distance.

To enhance the accuracy of the drift detection, especially in the presence of noise common in trace data, I applied additional preprocessing steps such as Fourier transform and moving average. These techniques help to smooth the data, making it easier to detect true drift signals. Finally, a voting strategy was used in combination with post-processing methods to build and refine the final datasets.

The first figure below illustrates the segments of IOPS where drift has been detected. The second figure shows the segments of data where no drift occurs.

PART 2: Benchmark Drift Detection Algorithms

This part focuses on benchmarking the Jensen-Shannon and Wasserstein drift detection methods using system trace data. The evaluation metrics are categorized into three main areas:

Detection Accuracy Metrics

True Positive Rate (Recall)
True Negative Rate (Specificity)
Precision
F1-Score

Detection Overhead Metrics

Time Taken: The computational time required to detect drifts, critical

Stability Metrics

False Positive Rate
False Negative Rate

(Additional) Comparative Analysis:

Accuracy Across Different Features: How well the detection algorithms perform when applied to various features within the system trace data.

Discussion

The results clearly demonstrate that the Jensen-Shannon distance method outperforms the Wasserstein distance method in detecting drift. Additionally, the write size ratio proves to be a more effective feature for representing the variations in the data, offering a more nuanced understanding of the underlying changes.

Conclusion and Next Steps

In conclusion, this project establishes a pipeline that encompasses data labeling, data processing, and the benchmarking of drift detection algorithms. This just serves as the first step in detecting drift in system data.

There is significant potential for further improvement. Future work should focus on enhancing dataset construction by incorporating large language models (LLMs) and other advanced techniques to further clean and refine the datasets. Additionally, the evaluation of drift detection methods should be expanded beyond the current benchmarks, which only include two statistical methods. Incorporating additional statistical methods, as well as machine learning (ML) and deep learning (DL) approaches, could provide a more comprehensive analysis. Furthermore, exploring a broader range of evaluation metrics will ensure a more robust and accurate assessment of drift detection performance. These steps will help to advance the accuracy and reliability of drift detection in system trace data.

Deliverables

The following are the deliverables of this project:

Trovi Artifact
Github Repository: This repository contains the code for generating drift datasets with labels and notebooks with benchmarking results

Midterm Blog: ML in Detecting and Addressing System Drift

Sun, 21 Jul 2024 00:00:00 +0000

Hello! I’m Joanna! Over the past month, I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces. The goal is to characterize different drifts, understand how they affect model performance, and evaluate the performance of state-of-the-art (SOTA) drift detection algorithms.

Progress

Over the past month, I’ve primarily been constructing a data drift dataset from the Tencent I/O block trace, which includes both drift and non-drift data. By combining offline drift detection algorithms such as Maximum Mean Discrepancy, Cramér-von Mises, and Kolmogorov-Smirnov, I am developing a dataset that contains segments with and without drifts for features such as IOPS (Input/Output Operations Per Second), read/write size ratio, write size, and other relevant performance metrics. The diagrams below illustrate the data segments identified with and without drifts, respectively.

In addition to constructing the datasets, I have begun evaluating some online drift detection algorithms and designing metrics to assess their performance. I have tested the performance of online drift detection algorithms such as Online Maximum Mean Discrepancy and Online Cramér-von Mises under various settings, including different window lengths and sensitivity levels. The following diagrams illustrate the drift points detected for the IOPS feature under these different settings.

Next Steps

Here are my plans for the next month:

Complete the experiments on data drift and generate improved visualizations to summarize the performance of these online drift detection algorithms, including their overhead and accuracy over time.
Characterize drifts by identifying the types of drifts that lead to model performance degradation
Evaluate drift detection algorithms in the context of concept drifts.

Stay tuned for my future updates on this project!

StatWrap: Automated Reproducibility Checklists Generation

Fri, 14 Jun 2024 00:00:00 +0000

Namaste🙏🏻! I am Adi Akhilesh Singh, currently pursuing a degree in Computer Science and Engineering at IIT(BHU). This summer, I will be working on the StatWrap: Automated Reproducibility Checklists Generation project under the mentorship of Luke Rasmussen. You can view my project proposal for more details.

My project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Stay tuned for updates on my progress in the coming weeks! 🚀

Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream

Thu, 13 Jun 2024 00:00:00 +0000

Hello, I am Acheme, currently a PhD student in Computer Engineering at Clemson University. I will be working on SciStream, mentored by Joaquin Chung and Flavio Castro over this summer. Here is my proposal - for this project.

This project aims to develop SciStream-bench, a set of benchmarks and artifacts designed to precisely evaluate the performance of scientific streaming applications across diverse traffic patterns when running over the SciStream framework.

I am excited to meet everyone and contribute to this project!

LAST: ML in Detecting and Addressing System Drift

Tue, 11 Jun 2024 00:00:00 +0000

Hello! I am Joanna, currently an undergraduate student studying Computer Science and Applied Mathematics and Statistics at Johns Hopkins University. I will be working on ML in Detecting and Addressing System Drift, mentoring by Ray Andrew Sinurat and Sandeep Madireddy over this summer. Here is my proposal for this project.

This project aims to build a data analysis pipeline to analyze various datasets, both system and non-system, that have shown notable changes over time. The goal is to understand the characteristics of these datasets(specifically drifts), evaluate the efficacy of Aging Detection Algorithms, and identify their limitations in computer system tasks.

I am excited to meet everyone and contribute to this project!