machine learning | UCSC OSPO

CauST: Causal Gene Intervention for Robust Spatial Domain Identification

Wed, 21 Jan 2026 00:00:00 +0000

Topics: spatial transcriptomics, spatial domain identification, causal inference, gene intervention
Skills:
- Programming Languages: Python (PyTorch preferred)
- Machine Learning: causal inference, representation learning, clustering
- Data Analysis: spatial transcriptomics preprocessing and evaluation (ARI, cross-slice generalization)
- Bioinformatics Knowledge (preferred): spatial transcriptomics, scRNA-seq, gene perturbation analysis
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Lijinghua Zhang (contact person)

Project Idea Description

Spatial domain identification is a core task in spatial transcriptomics (ST), aiming to segment tissue sections into biologically meaningful regions based on spatially resolved gene expression profiles. These spatial domains often correspond to anatomical layers, functional niches, or microenvironmental states, and are widely used as the basis for downstream biological interpretation.

Despite strong empirical performance, most existing spatial domain identification methods rely on purely correlational gene signals. Genes are selected or weighted based on association with spatial patterns, without distinguishing whether they causally drive domain formation or merely reflect downstream or confounded effects. As a result, current models often suffer from limited robustness and poor generalization across tissue sections or donors.

Problem: Correlation-Driven Gene Usage and Limited Generalization

In standard pipelines, gene expression features are typically used wholesale or filtered using heuristic criteria (e.g., highly variable genes). However, many genes that are strongly correlated with spatial domains are not causally responsible for domain structure. Including such non-causal or confounded genes can:

Reduce robustness across slices and donors
Obscure true domain-driving biological signals
Limit interpretability of spatial domain assignments

Empirically, domain identification performance often degrades substantially in cross-slice or cross-donor evaluation settings, underscoring the need for causally informed feature selection.

Proposed Solution: CauST

This project proposes CauST, a Causal Gene Intervention framework for robust spatial domain identification.

CauST aims to identify domain-driving genes by estimating their causal influence on spatial domain assignments via in-silico gene interventions. Instead of relying on observational correlations, CauST approximates counterfactual gene knockouts by perturbing individual gene expressions while controlling for confounding factors.

In addition, CauST leverages cross-slice invariance as a practical criterion for causal gene discovery, prioritizing genes whose effects on spatial domain identification remain stable across tissue sections and donors.

By filtering or reweighting genes based on estimated causal influence, CauST improves the robustness, generalizability, and interpretability of spatial domain identification models.

Project Objectives

Causal Gene Effect Estimation
- Design in-silico intervention strategies to estimate gene-level causal effects on spatial domain assignments.
Invariant Effect Analysis
- Identify genes with stable effects across tissue sections or donors.
Causal Gene Filtering
- Develop filtering or reweighting schemes based on estimated causal influence.
Integration with Existing Methods
- Integrate CauST into state-of-the-art spatial domain identification pipelines.
Evaluation and Validation
- Benchmark robustness, cross-slice generalization, and interpretability on public spatial transcriptomics datasets.

Project Deliverables

CauST Framework Implementation
- Open-source Python implementation compatible with common spatial transcriptomics toolchains.
Causal Gene Benchmarks
- Quantitative evaluation of causal gene filtering and its impact on domain identification.
Visualization Tools
- Tools for visualizing gene interventions, causal scores, and spatial effects.
Documentation and Tutorials
- Clear examples enabling adoption of CauST by the broader community.

Impact

CauST introduces a causally grounded perspective to spatial domain identification by explicitly modeling gene-level interventions. By shifting from correlation-driven gene usage to causal gene selection, this project improves robustness, generalizability, and biological interpretability in spatial transcriptomics analysis. CauST has the potential to serve as a foundational framework for integrating causal reasoning into spatial omics representation learning.

Agent4Target: An Agent-based Evidence Aggregation Toolkit for Therapeutic Target Identification

Tue, 20 Jan 2026 00:00:00 +0000

Topics: therapeutic target identification, drug discovery, evidence aggregation, AI agents, biomedical knowledge integration
Skills:
- Programming Languages: Python; experience with modern ML tooling preferred
- Machine Learning / AI: agent-based systems, workflow orchestration, weak supervision (basic), representation learning
- Software Engineering: modular system design, APIs, CLI tools, documentation
- Biomedical Knowledge (preferred): familiarity with drug–target databases (e.g., PHAROS, DepMap, Open Targets)
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Identifying and prioritizing high-quality therapeutic targets is a foundational yet challenging task in drug discovery. Modern target identification relies on aggregating heterogeneous evidence from multiple sources, including genetic perturbation screens, disease associations, chemical biology, and biomedical literature. These evidence sources are highly fragmented, noisy, and heterogeneous in both format and reliability.

While large language models and AI agents have recently shown promise in automating scientific workflows, many existing approaches focus on end-to-end prediction or conversational interfaces. Such systems are often difficult to reproduce, extend, or integrate into existing research pipelines, limiting their practical adoption by the biomedical community.

This project proposes Agent4Target, an agent-based evidence aggregation toolkit that reframes therapeutic target identification as a structured, modular workflow. Instead of using agents for free-form reasoning, Agent4Target employs agents as orchestrated components that systematically collect, normalize, score, and explain evidence supporting candidate therapeutic targets.

The goal is to deliver a reusable, open-source toolchain that can be integrated into diverse drug discovery workflows, independent of any single downstream prediction model or publication.

Key Idea and Technical Approach

Agent4Target models target identification as a multi-stage, agent-driven pipeline, coordinated by a central orchestrator:

Evidence Collector Agents
Specialized agents retrieve target-level evidence from heterogeneous sources, such as:
- Genetic perturbation and dependency data (e.g., DepMap)
- Target annotation and development status (e.g., PHAROS)
- Disease association scores (e.g., Open Targets)
- Automatically summarized literature evidence
Normalization & Scoring Agent
Collected evidence is converted into a unified, structured schema using typed data models (e.g., JSON / Pydantic).
This agent performs:
- Evidence normalization across sources
- Confidence-aware scoring and aggregation
- Optional weighting or calibration strategies
Explanation Agent
Rather than free-text generation, this agent produces structured explanations that explicitly link scores to supporting evidence, enabling transparency and interpretability for downstream users.
Workflow Orchestrator
A lightweight orchestration layer (e.g., LangGraph or a state-machine-based controller) manages agent execution, dependencies, and failure handling, ensuring reproducibility and extensibility.

This modular design allows individual agents to be replaced, extended, or reused without altering the overall system.

Project Objectives

Design a Modular Agent-based Architecture
- Define clear interfaces for evidence collection, normalization, scoring, and explanation agents.
Implement a Standardized Evidence Schema
- Develop a unified data model for heterogeneous target-level evidence.
Build a Reproducible Orchestration Framework
- Implement a deterministic, inspectable workflow for agent coordination.
Deliver a Community-Ready Toolkit
- Provide CLI tools, example notebooks, and clear documentation to support adoption.
Benchmark and Case Studies
- Demonstrate the toolkit on representative target identification scenarios using public datasets.

Project Deliverables

Open-Source Agent4Target Codebase
- A well-documented Python package with modular agent components.
Command-Line Interface (CLI)
- Tools for running end-to-end evidence aggregation pipelines.
Standardized Output Schema
- Machine-readable evidence summaries suitable for downstream modeling.
Example Notebooks and Benchmarks
- Demonstrations of usage and performance on real-world target identification tasks.
Documentation
- Installation guides, extension tutorials, and developer documentation.

Impact

Agent4Target provides a practical bridge between AI agents and real-world drug discovery workflows. By emphasizing structured evidence aggregation, reproducibility, and interpretability, this project enables researchers to systematically reason about therapeutic targets rather than relying on opaque, end-to-end models. The resulting toolkit can serve as a foundation for future work in AI-assisted drug discovery, weak supervision, and biomedical knowledge integration.

HistoMoE: A Histology-Guided Mixture-of-Experts Framework for Gene Expression Prediction

Tue, 20 Jan 2026 00:00:00 +0000

Topics: computational pathology, spatial transcriptomics, gene expression prediction, mixture-of-experts, multimodal learning
Skills:
- Programming Languages: Python; experience with PyTorch preferred
- Machine Learning: CNNs / vision encoders, mixture-of-experts, multimodal representation learning
- Data Analysis: handling large-scale histology image patches and gene expression matrices
- Bioinformatics Knowledge (preferred): familiarity with spatial transcriptomics or scRNA-seq data
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Histology imaging is one of the most widely available data modalities in biomedical research and clinical practice, capturing rich morphological information about tissues and disease states. In parallel, spatial transcriptomics (ST) technologies provide spatially resolved gene expression measurements, enabling unprecedented insights into tissue organization and cellular heterogeneity. However, the high cost and limited accessibility of ST experiments remain a major barrier to their widespread adoption.

Predicting gene expression directly from histology images offers a promising alternative, enabling molecular-level inference from routinely collected pathology data. Existing approaches typically rely on a single global model that maps image embeddings to gene expression profiles. While effective to some extent, these models struggle to capture the strong organ-, tissue-, and cancer-specific heterogeneity that underlies gene expression patterns.

This project proposes HistoMoE, a histology-guided mixture-of-experts (MoE) framework that explicitly models biological heterogeneity by learning specialized expert models for different cancer types or organs, and dynamically routing histology image patches to the most relevant experts.

Key Idea and Technical Approach

As illustrated in the figure above, HistoMoE integrates multiple data modalities and learning components:

Vision Encoder
Histology image patches are encoded into high-dimensional visual representations using a convolutional or transformer-based vision backbone.
Text / Metadata Encoder
Sample-level metadata (e.g., tissue type, organ, disease context) is encoded using a lightweight text or embedding model.
Gating Network
A gating network jointly considers image and metadata embeddings to infer routing weights over multiple cancer- or organ-specific expert models.
Expert Models
Each expert specializes in modeling gene expression patterns for a specific biological context (e.g., CCRCC, COAD, LUAD), producing patch-level gene expression predictions.

By explicitly modeling biological structure through expert specialization, HistoMoE aims to improve both prediction accuracy and interpretability, allowing researchers to understand which biological experts drive each prediction.

Project Objectives

Design and Implement the HistoMoE Framework
- Build a modular MoE architecture with pluggable vision encoders, gating networks, and expert models.
Multimodal Routing and Expert Specialization
- Explore how image features and metadata jointly inform expert selection.
Benchmarking and Evaluation
- Compare HistoMoE against single-model baselines on multiple cancer and organ-specific spatial transcriptomics datasets.
Interpretability Analysis
- Analyze expert routing behavior to reveal biologically meaningful patterns.

Project Deliverables

Open-Source HistoMoE Codebase
- Well-documented Python implementation with training, evaluation, and visualization tools.
Benchmark Results
- Quantitative comparisons demonstrating improvements over non-expert baselines.
Visualization and Analysis Tools
- Tools for inspecting expert usage, routing weights, and gene-level predictions.
Documentation and Tutorials
- Clear instructions and examples to enable adoption by the research community.

Impact

HistoMoE introduces an expert-system perspective to histology-based gene expression prediction, bridging morphological and molecular representations through biologically informed specialization. By combining multimodal learning with mixture-of-experts modeling, this project advances the interpretability and accuracy of computational pathology methods and contributes toward scalable, cost-effective alternatives to spatial transcriptomics experiments.

StaR: A Stability-Aware Representation Learning Framework for Spatial Domain Identification

Tue, 20 Jan 2026 00:00:00 +0000

Topics: spatial transcriptomics, spatial domain identification, representation learning, model robustness
Skills:
- Programming Languages: Python; PyTorch experience preferred
- Machine Learning: representation learning, clustering, robustness and stability analysis
- Data Analysis: spatial transcriptomics preprocessing and evaluation (ARI, clustering metrics)
- Bioinformatics Knowledge (preferred): familiarity with spatial transcriptomics or scRNA-seq data
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Spatial domain identification is a fundamental task in spatial transcriptomics (ST), aiming to partition tissue sections into biologically meaningful regions based on spatially resolved gene expression profiles. These spatial domains often correspond to distinct anatomical structures, cellular compositions, or functional microenvironments, and serve as a critical foundation for downstream biological analysis.

Despite rapid methodological progress, most existing spatial domain identification methods are highly sensitive to random initialization. In practice, simply changing the random seed can lead to substantially different clustering results and large performance fluctuations, even when using identical hyperparameters and datasets. This instability severely undermines the reliability, reproducibility, and interpretability of spatial transcriptomics analyses.

Problem: Seed Sensitivity and Unstable Representations

Empirical evidence shows that state-of-the-art spatial domain identification models can exhibit substantial performance variance across random seeds. For example, the Adjusted Rand Index (ARI) may vary from relatively strong performance (e.g., ARI ≈ 0.65) to noticeably degraded yet still reasonable outcomes (e.g., ARI ≈ 0.50) solely due to different random initializations.

By systematically evaluating models across hundreds to thousands of random seeds, we observe that:

Model performance landscapes are highly rugged, with sharp cliffs and isolated high-performing regions.
Standard training objectives implicitly favor brittle representations that are not robust to small perturbations in initialization or optimization trajectories.

These observations suggest that instability is not a peripheral issue, but rather a structural limitation of current representation learning approaches for spatial transcriptomics.

Proposed Solution: StaR

This project proposes StaR, a Stability-Aware Representation Learning framework designed to explicitly address seed sensitivity in spatial domain identification.

The core idea of StaR is to learn representations that are robust to perturbations in model parameters and training dynamics, rather than optimizing solely for peak performance under a single random seed. Concretely, StaR introduces controlled noise or perturbations into the training process and encourages consistency across multiple perturbed model instances, guiding the model toward flatter and more stable regions of the parameter space.

By prioritizing stability during representation learning, StaR aims to produce embeddings that:

Yield consistent spatial domain assignments across random seeds
Maintain competitive or improved clustering accuracy
Better reflect underlying biological structure

Project Objectives

Characterize Instability in Existing Methods
- Systematically quantify seed sensitivity across popular spatial domain identification models.
Develop Stability-Aware Training Objectives
- Design perturbation-based or consistency-driven losses that encourage robust representations.
Integrate StaR into Existing Pipelines
- Apply StaR to widely used spatial transcriptomics workflows with minimal architectural changes.
Evaluation and Benchmarking
- Evaluate StaR using clustering metrics (e.g., ARI) and stability metrics across multiple datasets and random seeds.
Biological Validation
- Assess whether stability-aware representations preserve biologically meaningful spatial patterns.

Project Deliverables

StaR Framework Implementation
- An open-source Python implementation compatible with common spatial transcriptomics toolchains.
Stability Benchmarks
- Comprehensive evaluations demonstrating reduced performance variance across seeds.
Visualization Tools
- Tools for visualizing performance landscapes, stability surfaces, and spatial domain consistency.
Documentation and Tutorials
- Clear examples enabling researchers to adopt StaR in their own analyses.

Impact

StaR addresses a critical yet underexplored challenge in spatial transcriptomics: model instability and poor reproducibility. By shifting the focus from single-run performance to stability-aware representation learning, this project improves the reliability and trustworthiness of spatial domain identification methods. StaR has the potential to become a foundational component in robust spatial transcriptomics pipelines and to inspire broader adoption of stability-aware principles in biological representation learning.

Final Report for Smart Environments

Wed, 05 Nov 2025 00:00:00 +0000

Introduction

The process of creating the necessary software environment for code to run is a significant challenge in software development. Given a piece of open-source software intended for research, setting up the environmental dependencies to run the software could take significant manual effort. Existing automation methods struggle due to the complexity of managing diverse languages, dependencies, and hardware. In Smart Environments, I have created ENVAGENT, a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

To assess this capability, a new benchmark, ENVBENCH, was created, containing 54 popular projects across seven languages. Results show ENVAGENT dramatically improves environment construction compared to current agents (+16.2%). Furthermore, the system shows initial promise in dynamically adjusting cloud-based hardware resources based on the code’s needs.

Method

EnvAgent

The EnvAgent I created during my time at OSRE utilizes a multi-agent workflow to automatically build software execution environments. The process is structured into three phases: preparation, construction, and refinement.

Phase 1 (Preparation): Specialized agents collect information about the software repository – its structure, relevant files, and the host system’s hardware specifications (CPU, memory, etc.). This data is then used by a planning agent to generate a detailed, step-by-step instruction set for creating a functional Dockerfile.

Phase 2 (Construction): Two agents work in tandem: one generates or modifies the Dockerfile based on the plan, while the other executes the Dockerfile within an isolated container, capturing any errors.

Phase 3 (Refinement): A final agent analyzes the container execution data, identifying areas for improvement in the Dockerfile. This process repeats until a stable, executable environment is achieved.

To improve efficiency, EnvAgent incorporates rule-based tools for predictable tasks like directory setup and log management, reducing the need for complex agent reasoning. This combination of intelligent agents and automated routines (“scaffolding”) ensures a robust and adaptive system.

EnvEval Benchmark

In addition to the agent, one significant contribution is the manual curation of a benchmark that measures the quality of generated environments. EnvEval is a benchmark specifically designed to assess environment setup qualities across 54 carefully curated open-source repositories. They are chosen from both Chameleon reproducible artifacts and Multi-SWE-bench dataset. EnvEval contains json rubrics that can be used to automatically determine the quality of constructed environments.

Each rubric is divided into three parts, corresponding to three major objectives that a successfully constructed environment should have:

Structure: Checks for basic directory structure, file presence, and environment variables.
Configuration: Asks the question “Is this configured?”, checks for whether dependencies have been correctly configured.
Functionality: Asks the question “Is this usable?”, runs actual tests to see if the functionalities are present.

There are many tests in each category, and their weights are adjusted based on their importance.

Evaluation

Baseline Systems:

The study compared EnvAgent to two established automated code generation systems: one utilizing Anthropic’s advanced reasoning models and the other employing OpenAI’s code-focused models. These systems were chosen for their strong performance in creating software code and their prevalence in automated engineering processes. Both baselines were given full access to the target software repositories and complete details about the host system’s hardware.

Evaluation Metrics:

The performance of EnvAgent was assessed using three key metrics. These included the ability to create working environments, the quality of those environments, and a single combined score. Results showed EnvAgent significantly outperformed the baselines, achieving a 33.91% improvement in the final overall score – reaching 74.01, which was higher than the best baseline score of 30.10. This suggests EnvAgent produced both more functional environments and ensured greater accuracy through extensive testing.

Conclusion

The process of creating the necessary software environments for code agents is a major hurdle in scaling up research and development. Currently, this task relies heavily on manual labor. To address this, a new system, ENVAGENT, was created to automatically build these environments using intelligent agents and by understanding dependencies. A new benchmark, ENVBENCH, was also developed to assess this system’s effectiveness. Preliminary results demonstrate a significant improvement – ENVAGENT achieved a 33.91% increase in success rates compared to existing automated agents, representing a substantial step towards more efficient and reproducible research.

Thank you!

Autofill

; 20251105-Sam_Huang

Midterm for Smart Environments

Thu, 24 Jul 2025 00:00:00 +0000

What is EnvGym?

EnvGym is a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

EnvGym addresses this gap by leveraging LLM-powered agents to analyze project instructions, resolve dependencies, configure execution environments, and validate results—thereby reducing human overhead and improving reproducibility at scale.

Progress

New Tools

Initially, our agent had access to only one tool: the command line. This constrained the agent’s ability to decompose complex tasks and respond flexibly to failures. Over the last few weeks, we introduced a modular tool system, enabling the agent to handle specific subtasks more effectively.

The new toolset includes:

dockerrun: Executes Dockerfiles.
hardware_checking, hardware_adjustment: Tailor builds to available resources.
history_manager, stats: Tracks historical data for improvement and reproducibility.
planning: Generates high-level execution plans.
summarize: Interprets build results to adjust subsequent iterations.
writing_docker_initial, writing_docker_revision: Generate and refine Dockerfiles.

While some of those tools, such as dockerrun, run programmatic scripts, other scripts such as planning are more complex and use LLMs themselves.

Agent Re-Architecture: Moving Beyond Codex

We transitioned away from OpenAI’s Codex agent implementation. While powerful, Codex’s framework was overly reliant on its CLI frontend, which added unnecessary complexity and limited customizability for our research context.

We implemented our own lightweight, customizable agent pipeline that integrates LLM-based planning with iterative execution. Conceptually, the agent executes the following loop:

Repo Scanning
Hardware Check
Planning & Initial Dockerfile Generation
Docker Execution
Progress Summarization & Adjustment
Iterative Dockerfile Refinement (up to 20 rounds)
Success Check & Logging

This new agent design is easier to control, extend, and debug—aligning better with the needs of reproducibility research.

Prompt Engineering

For each tool that requires LLMs to function, we created a set of custom prompts that outline the task and breaks down the goals. For instance, the prompt used in summarize differs from the one in planning, allowing us to optimize the behavior of LLM agents per context.

Performance Gains

With these improvements, EnvGym now successfully replicates 9 repositories, surpassing our baseline Codex agent which struggled with the same set. We’ve observed more reliable planning, better handling of edge-case dependencies, and faster convergence in iterative Dockerfile revisions.

Next Steps

Granular Evaluation Metric

We plan to adopt a tree-structured rubric-based evaluation, inspired by PaperBench. Instead of binary success/failure, each repo will be assigned a reproducibility score from 0–100.

Key tasks include:

Rubric Design: Define a hierarchical rubric with criteria like dependency resolution, test success rate, runtime match, etc.
Manual Annotation: Build a dataset of ground-truth rubrics for a subset of repos to calibrate our automatic judge.
Judge Implementation: Develop an LLM-based judge function that takes (i) rubric and (ii) environment state, and returns a reproducibility score.

Source: Starace, Giulio, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv preprint arXiv:2504.01848 (2025).

This will make EnvGym suitable for benchmarking. We will run our new method and obtain a score to compare with baseline methods!

Conclusion

EnvGym has made strong progress toward automating reproducibility in computational research. Through modularization, agentic design, and prompt optimizations, we’ve surpassed existing baselines and laid the groundwork for even more improvement.

The upcoming focus on metrics and benchmarking will elevate EnvGym from a functional prototype to a standardized reproducibility benchmark tool and also quantitatively prove that our new agentic method is better than existing tools such as Codex. Excited for what’s to come!

Autofill

; 20250724-Sam_Huang

RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Thu, 19 Jun 2025 00:00:00 +0000

Hi everyone! My name is Zeyu, and I will be working on a project for a retrieval-enhanced generative framework for spatial transcriptomics during Google Summer of Code 2025. My project is called RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics and is supervised by Ziheng Duan. The goal is to develop a retrieval-enhanced generative framework for predicting spatial gene expression from histological images, making spatial transcriptomics more affordable and easier to implement. You can view my full proposal here!

Spatial transcriptomics enables the capture of gene expression profiles with spatial resolution, providing unprecedented insights into cellular organization and the tissue microenvironment. However, its widespread application is limited by high costs and technical complexity. In contrast, histological imaging is inexpensive and widely accessible. If we can accurately predict gene expression from histology images, then high-resolution spatial information can be inferred without costly experiments.

My project will:

Create a large-scale paired dataset combining HEST histology images with reference gene expression profiles from CellxGene.
Design a novel RAG-ST architecture that enables both interpretable and controllable generation of spatial gene expression.
Benchmark RAG-ST against current state-of-the-art models for image-based gene expression inference.
Open-source the full codebase and provide comprehensive tutorials to support future research and development.

I am excited to contribute to this project and help broaden access to spatial transcriptomics insights through machine learning–powered predictions!

Zeyu Zou

University of Northeastern Graduate

Zeyu Zou is a graduate student at the University of Northeastern, where he is majoring in Analytics.

EnvGym – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hello, My name is Yiming Cheng. I am a Pre-doc researcher in Computer Science at University of Chicago. I’m excited to be working with the Summer of Reproducibility and the Chameleon Cloud community as a project leader. My project is EnvGym that focuses on developing an AI-driven system to automatically generate and configure reproducible computing environments based on natural language descriptions from artifact descriptions, Trovi artifacts, and research papers.

The complexity of environment setup often hinders reproducibility in scientific computing. My project aims to bridge the knowledge gap between experiment authors and reviewers by translating natural language requirements into actionable, reproducible configurations using AI and NLP techniques.

Project Overview

EnvGym addresses fundamental reproducibility barriers by:

Using AI to translate natural language environment requirements into actionable configurations
Automatically generating machine images deployable on bare metal and VM instances
Bridging the knowledge gap between experiment authors and reviewers
Standardizing environment creation across different hardware platforms

June 10 – June 16, 2025

Getting started with the project setup and initial development:

I began designing the NLP pipeline architecture to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into structured environment “recipes”
I set up the initial project repository and development environment
I met with my mentor Prof. Kexin Pei to discuss the project roadmap and technical approach
I started researching existing artifact descriptions from conferences and Trovi to understand common patterns in environment requirements
I began prototyping the backend environment builder logic that will convert parsed requirements into machine-image definitions
I explored Chameleon’s APIs for provisioning servers and automated configuration

Next Steps

Continue developing the NLP component for requirement parsing
Implement the core backend logic for environment generation
Begin integration with Chameleon Cloud APIs
Start building the user interface for environment specification

This is an exciting and challenging project that combines my interests in AI systems and reproducible research. I’m looking forward to building a system that will help researchers focus on their science rather than struggling with environment setup issues.

Thanks for reading, I will keep you updated as I make progress on EnvGym!

Smart Environments – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hi everyone, I’m Sam! I’m excited to be working with the Argonne National Laboratory and SoR this summer on Smart Environments. Have you ever encountered a great opensource project and wanted to run it or use it locally, only to find that it’s such a headache to set up all the dependencies? Maybe your system version wasn’t correct, or a piece of software was outdated, or the dependencies were incompatible with something you had already on your machine?

In comes EnvGym to save the day! We want EnvGym to be an agent that would help reproduce opensource projects by automatically setting up the environmental dependencies required to get them running. That’s what I will be working on for the rest of the summer! To make EnvGym work, we will be leveraging LLM agents to tackle the problem. We will use EnvGym to read documentations, understand code structures, run commands to set up environments, and reflectively react to any errors and warnings.

To build EnvGym, I have the following to-do’s in mind:

Building a dataset that includes repos to be reproduced
Establishing a baseline using current methods
Implementing the actual EnvGym algorithm
Testing EnvGym against baseline performance and iteratively improving it
Deploying EnvGym to real-world use cases and gathering feedback

Here is the repo that we are working on: https://github.com/EaminC/EnvGym/tree/main

More updates to come, thanks for reading!

Applying MLOps to overcome reproducibility barriers in machine learning research

Sat, 01 Mar 2025 00:00:00 +0000

Topics: machine learning, MLOps, reproducibility
Skills: Python, machine learning, GitOps, systems, Linux, data, Docker
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Reproducibility remains a significant problem in machine learning research, both in core ML and in the application of ML to other areas of science. In many cases, due to inadequate experiment tracking, dependency capturing, source code versioning, data versioning, and artifact sharing, even the authors of a paper may find it challenging to reproduce their own study several years later. This makes it difficult to vaidate and build on previous work, and raises concerns about its trustworthiness.

In contrast, outside of academic research, MLOps tools and frameworks have been identified as a key enabler of reliable, reproducible, and trustworthy machine learning systems in production. A good reference on this topic is:

Firas Bayram and Bestoun S. Ahmed. 2025. Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach. ACM Comput. Surv. 57, 5, Article 121 (May 2025), 35 pages. https://doi.org/10.1145/3708497

This project seeks to bridge the gap between widely adopted practices in industry and academic research:

by making it easier for researchers and scientists to use MLOps tools to support reproducibility. To achieve this, we will develop starter templates and recipes for research in computer vision, NLP, and ML for science, that have reproducibility “baked in” thanks to the integration of MLOps tools and frameworks. Researchers will launch these templates on open access research facilities like Chameleon.
and, by developing complementary education and training materials to emphasize the important of reproducibility in ML, and how the tools and frameworks used in the starter templates can support this goal.

Writing a successful proposal for this project

A good proposal for this project should -

demonstrate a good understanding of the current barriers to reproducibility in machine learning research (specific examples are welcome),
describe a “base” starter template, including the platforms and tools that will be integrated, as well as specific adaptations of this template for computer vision, NLP, and ML for science,
explain the “user flow” - how a researcher would use the template to conduct an experiment or series of experiments, what the lifecycle of that experiment would look like, and how it would be made reproducible,
include the contributor’s own ideas about how to make the starter templates more usable, and how to make the education and training materials relatable and useful,
and show that the contributor has the necessary technical background and soft skills to contribute to this project. In particular, the contributor will need to create education and training materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository in the Teaching on Testbeds organization, and the project materials will “live” there.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The complexity of environment setup and the expertise required to configure specialized software stacks can often hinder efforts to reproduce important scientific achievements in HPC and systems studies. Researchers often struggle with incomplete or ambiguous artifact descriptions that make assumptions about “common knowledge” that is actually specific domain expertise. When trying to reproduce experiments, reviewers may spend excessive time debugging environment inconsistencies rather than evaluating the actual research. These challenges are compounded when experiments need to run on different hardware configurations.

This project seeks to address these fundamental reproducibility barriers by using AI to translate natural language environment requirements often used in papers or artifact descriptions into actionable, reproducible configurations—bridging the knowledge gap between experiment authors and reviewers while standardizing environment creation across different hardware platforms. We will develop an AI-driven system that automatically generates and configures reproducible computing environments based on artifact descriptions from conferences, Trovi artifacts on the Chameleon testbed, and other reliable sources for scientific experiment code and associated documentation. Leveraging Natural Language Processing (NLP), the system will allow researchers to describe desired environments in plain English, then map those descriptions onto predefined configuration templates. By simplifying environment creation and ensuring reproducibility, the system promises to eliminate duplicate setup efforts, accelerate research workflows, and promote consistent experimentation practices across diverse hardware.

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Disentangled Generation and Editing of Pathology Images

Fri, 07 Feb 2025 00:00:00 +0000

Topics: computational pathology, image generation, disentangled representations, latent space manipulation, deep learning
Skills:
- Programming Languages:
  - Proficient in Python, with experience in machine learning libraries such as PyTorch or TensorFlow.
- Generative Models:
  - Familiarity with Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and contrastive learning methods.
- Data Analysis:
  - Image processing techniques, statistical analysis, and working with histopathology datasets.
- Biomedical Knowledge (preferred):
  - Basic understanding of histology, cancer pathology, and biological image annotation.
Difficulty: Advanced
Size: Large (350 hours). The project involves substantial computational work, model development, and evaluation of generated pathology images.
Mentors: Xi Li (contact person), Mentor Name

Project Idea Description

The project aims to advance the generation and disentanglement of pathology images, focusing on precise control over key histological features. By leveraging generative models, we seek to create synthetic histological images where specific pathological characteristics can be independently controlled.

Challenges in Current Approaches

Current methods in histopathology image generation often struggle with:

Feature Entanglement: Difficulty in isolating individual factors such as cancer presence, severity, or staining variations.
Lack of Control: Limited capability to manipulate specific pathological attributes without affecting unrelated features.
Consistency Issues: Generated images often fail to maintain realistic cellular distributions, affecting biological validity.

Project Motivation

This project proposes a disentangled representation framework to address these limitations. By separating key features within the latent space, we aim to:

Control Histological Features: Adjust factors such as cancer presence, tumor grade, number of malignant cells, and staining methods.
Ensure Spatial Consistency: Maintain the natural distribution of cells during image reconstruction and editing.
Enable Latent Space Manipulation: Provide interpretable controls for editing and generating realistic histopathology images.

Project Objectives

Disentangled Representation Learning:
- Develop generative models (e.g., VAEs, GANs) to separate and control histological features.
Latent Space Manipulation:
- Design mechanisms for intuitive editing of pathology images through latent space adjustments.
Spatial Consistency Validation:
- Implement evaluation metrics to ensure that cell distribution remains biologically consistent during image generation.

Project Deliverables

Generative Model Framework:
- An open-source Python implementation for pathology image generation and editing.
Disentangled Latent Space Tools:
- Tools for visualizing and manipulating latent spaces to control specific pathological features.
Evaluation Metrics:
- Comprehensive benchmarks assessing image quality, feature disentanglement, and biological realism.
Documentation and Tutorials:
- Clear guidelines and code examples for the research community to adopt and build upon this work.

Impact

By enabling precise control over generated histology images, this project will contribute to data augmentation, model interpretability, and biological insight in computational pathology. The disentangled approach offers new opportunities for researchers to explore disease mechanisms, develop robust diagnostic models, and improve our understanding of cancer progression and tissue morphology.

RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Wed, 15 Jan 2025 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene expression generation, retrieval-augmented generation, large models
Skills:
- Programming Languages:
  - Proficient in Python, and familiarity with machine learning libraries such as PyTorch.
- Data Analysis:
  - Experience with spatial transcriptomics datasets and statistical modeling.
- Machine Learning:
  - Understanding of vision models, retrieval-based systems, and MLP architectures.
- Bioinformatics Knowledge (preferred):
  - Familiarity with scRNA-seq data integration and computational biology tools.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating RAG models, building a robust database, and ensuring interpretable predictions, this project involves substantial computational and data preparation work.
Mentors: Ziheng Duan (contact person)

Project Idea Description

Spatial transcriptomics (ST) is a revolutionary technology that provides spatially resolved gene expression measurements, enabling researchers to study cellular behaviour within tissues with unprecedented detail. This technology has transformed our understanding of complex biological systems, such as disease progression, tissue development, and cellular heterogeneity. However, the widespread adoption of ST is limited by its high cost and technical requirements.

Histology imaging, on the other hand, is far more accessible and cost-effective. If gene expression could be accurately predicted from histology images, it would enable researchers to leverage these abundant images for high-resolution biological insights without the need for expensive spatial transcriptomics experiments. This task has immense potential to democratize spatial transcriptomics research and significantly reduce costs.

Challenges in Current Approaches

Current methods for predicting gene expression from histology images typically involve:

Using large vision models to encode histology image patches into embeddings.
Employing Multi-Layer Perceptrons (MLPs) to map these embeddings to gene expression profiles.

While these approaches have shown promise, they suffer from two critical limitations:

Accuracy: The MLP-based mappings often fail to fully capture the biological complexity encoded in the histology images, leading to suboptimal predictions.
Interpretability: These models act as black boxes, providing no insight into the underlying biological rationale for the predictions. Researchers cannot determine why a specific gene expression profile was generated, limiting trust and utility in biological contexts.

Project Motivation

To overcome these limitations, this project proposes a novel Retrieval-Augmented Generation (RAG) framework for spatial transcriptomics. Instead of relying solely on black-box MLPs, RAG-ST will:

Retrieve relevant examples from a curated database of paired histology images, scRNA-seq data, and gene expression profiles.
Use these retrieved examples to inform and enhance the generation process, resulting in predictions that are both more accurate and biologically interpretable.

This approach not only grounds predictions in biologically meaningful data but also provides transparency by revealing which database entries influenced the results.

Project Objectives

Database Construction:
- Curate a large and diverse database of histology images paired with scRNA-seq and gene expression data.
Model Development:
- Develop a RAG framework combining vision-based encoders and retrieval-enhanced generation techniques.
- Incorporate interpretability mechanisms to link predicted gene expressions to retrieved examples.
Evaluation and Benchmarking:
- Assess RAG-ST against state-of-the-art methods, focusing on accuracy, interpretability, and biological validity.

Project Deliverables

Curated Database:
- A publicly available, well-documented database of histology images and gene expression profiles.
RAG-ST Framework:
- An open-source Python implementation of the RAG-ST model, with retrieval, generation, and visualization tools.
Benchmark Results:
- Comprehensive evaluations demonstrating the benefits of RAG-ST over conventional pipelines.
Documentation and Tutorials:
- User-friendly guides to facilitate adoption by the spatial transcriptomics research community.

Impact

By integrating retrieval-augmented generation with large models, RAG-ST represents a paradigm shift in spatial transcriptomics. It offers a cost-effective, accurate, and interpretable solution for gene expression prediction, democratizing access to high-quality spatial transcriptomic insights and fostering advancements in biological research.

ML-Powered Problem Detection in Chameleon

Fri, 18 Oct 2024 00:00:00 +0000

Hello! My name is Syed Mohammad Qasim, a PhD candidate at the Department of Electrical and Computer Engineering, Boston University. This summer I worked on the project ML-Powered Problem Detection in Chameleon as part of the Summer of Reproducibility (SoR) program with the mentorship of Ayse Coskun and Michael Sherman.

Chameleon is an open testbed that has supported over 5,000 users working on more than 500 projects. It provides access to over 538 bare metal nodes across various sites, offering approximately 15,000 CPU cores and 5 petabytes of storage. Each site runs independent OpenStack services to deliver its offerings. Currently, Chameleon Cloud comprehensively monitors the sites at the Texas Advanced Computing Center (TACC) and the University of Chicago. Metrics are collected using Prometheus at each site and fed into a central Mimir cluster. All logs are sent to a central Loki, with Grafana used for visualization and alerting. Chameleon currently collects around 3,000 metrics. Manually reviewing and setting alerts for them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service to augment the existing alerting framework.

Over the summer, we focused on analyzing the data and identified 33 key metrics, after discussions with Chameleon operators, from the Prometheus Node Exporter that serve as leading indicators of resource usage on the nodes. For example:

CPU usage: Metrics like node_load1, node_load5, and node_load15.
Memory usage: Including buffer utilization.
Disk usage: Metrics for I/O time, and read/write byte rates.
Network activity: Rate of bytes received and transmitted.
Filesystem metrics: Such as inode_utilization_ratio and node_procs_blocked.
System-level metrics: Including node forks, context switches, and interrupts.

Collected at a rate of every 5 minutes, these metrics provide a comprehensive view of node performance and resource consumption. After finalizing the metrics we wanted to monitor, we selected the following four anomaly detection methods, primarily due to their popularity in academia and recent publication in high-impact conferences such as SIG-KDD and SC.

Omni Anomaly, [KDD 2019] [without POT selection as it requires labels.]
USAD, [KDD 2020]
TranAD, [KDD 2022]
Prodigy, [SC 2023] [Only the VAE, not using their feature selection as it requires labels.]

We collected 75 days of healthy data from Chameleon, and after applying min-max scaling, we trained the models. We then used these models to run inference on the metrics collected during outages, as marked by Chameleon operators. The goal was to determine whether the outage data revealed something interesting or anomalous. We can verify our approach by manually reviewing the results generated by these four anomaly detection methods. Below are the results from the four methods on different outages, followed by an example of how these methods identified the root cause of an anomaly.

The above figure shows the percentage of outage data that was flagged as anomalous by different models.

The above two plots shows two examples of the top 5 metrics which contributed to the anomaly score by each anomaly detection model.

Although the methods seem to indicate anomalies during outages, they are not able to pinpoint the affected service or the exact cause. For example, the first partial authentication outage was due to a DNS error, which can manifest in various ways, such as reduced CPU, memory, or network usage. This work is still in progress, and we are conducting the same analysis on container-level metrics for each service, allowing us to narrow the scope to the affected service and more effectively identify the root cause of anomalies. We will share the next set of results soon.

Thanks for your time, please feel free to reach out to me for any details or questions.

Final Blog: BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Qianru! I have been contributing to the BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking project under the mentorship of Ziheng Duan. My project aims to provide a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics.

Motivation and Overview

The “BenchmarkST” project was driven by the need to address a critical challenge in spatial transcriptomics: the impact of sparse data on downstream tasks, such as spatial domain identification. Sparse data can significantly degrade the performance of these tasks. For example, in a 10X Visium dataset of human brain Dorsolateral Prefrontal Cortex (DLPFC), using the complete dataset with GraphST (a state-of-the-art clustering method) for clustering resulted in an ARI (Adjusted Rand Index) of 0.6347. However, when using only 20% of the data—a common scenario—the performance dropped dramatically to 0.1880. This stark difference highlights the importance of effective gene imputation, which can help restore the lost information and improve the accuracy of downstream analyses.

To tackle this issue, the BenchmarkST project led to the creation of the Impeller package. This package provides a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics, offering preprocessed datasets, reproducible evaluation methods, and flexible inference interfaces. It spans across different platforms, species, and organs, aiming to enhance the integrity and usability of spatial transcriptomics data.

What Was Accomplished

Development of the Impeller Package

Data Aggregation and Preprocessing:

We aggregated and preprocessed spatial transcriptomic datasets from multiple platforms (10X Visium, StereoSeq, SlideSeqV2), species (human, mouse), and organs (Dorsolateral Prefrontal Cortex, olfactory bulb). These datasets are readily available for download within the package.

Unified Evaluation Framework:

A reproducible framework was developed, integrating methods such as K-Nearest Neighbors (KNN) and the deep learning-based Impeller method, enabling users to easily evaluate the performance of different gene imputation techniques.

Inference Interfaces:

We provided interfaces that allow users to apply gene imputation on custom datasets, offering the flexibility to predict any gene in any cell, maximizing the utility for diverse research needs.

Code Contributions and Documentation

Repository:

All code related to the Impeller package has been committed to the Impeller repository.

Link to Versions:

Here you can find all the versions made during the project, with detailed descriptions of each change.

README.md:

Detailed documentation on how to use the Impeller package, including installation instructions, usage examples, and explanations of the key components.

Final Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Fri, 16 Aug 2024 00:00:00 +0000

Background

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. Before we started, let’s recap the goal of our project and our progress until mid term. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. In order to solve these challenges, we have collected the basic information of various common datasets for different machine learning tasks, and corresponding preprocessing pipelines.

Methodology

Since our goal is to improve the efficiency of the machine learning preprocessing pipeline and keep the training process of the Deep Learning model busy, it means that we need to enhance the preprocessing throughput which is the feed rate from the preprocessing stage to the training stage. According to some previous works, we have a new way to look at the Deep Learning Preprocessing Pipelines. The preprocessing pipeline can be split into 2 parts. The first part contains steps that are run once (S1-Sm). We can call it the “offline” part. The second part includes all the rest steps, which are run at every iteration of training. We call it the ”online” part. After the offline preprocessing steps, the output data is written back to disk. Then the online preprocessing steps need to load that data from storage first and do the following operations. We can split the pipeline at any step, and each split is a preprocessing strategy. By using this method, some specific strategies can achieve a much higher final preprocessing throughput. Our project adopts this method to profile the performance of different strategies. And our goal is to maximize the final preprocessing throughput into training, for a specific pipeline. We want to make this an automatic process, rather than ask for extra user instructions or parameters.

Experiment

Next, we did the data preprocessing strategy experiment on the LibriSpeech dataset, which is an audio dataset for ML tasks like Auto Speech Recognition. The dataset size is 6.3 GB with almost 30000 samples. Each audio file is in a binary format FLAC. As a result, the first step of the preprocessing pipeline we use is decoding, which converts the binary data into arrays of floats. Then we applied some typical audio preprocessing steps of transformation (normalization, padding, extract loudest section) and augmentation (random cut, random shift audio, random mask, random add noise) to audio data. Finally, the audio data is converted to Log-Mel Spectrogram signal, which is commonly used in audio tasks like Speech Recognition and Speaker identification.

We have benchmarked the throughput performance and storage overhead of all possible strategy split points, and have seen some trade-offs between them. Both storage overhead and throughput speed-up use the fully online method as the baseline. What we’ve observed from our results is that the speed-up keeps increasing when we put operations into the offline part, and the storage consumption is very low for the strategies after audio decoding. Also, we analysed the performance of individual methods of transformation and augmentation steps. We find that the speed-up performance is quite stable between 1.0 and 1.2 across these methods, but some methods can have a high storage overhead, like normalization and random noise.

Another thing we observed during our experiments is that different dataset sizes can influence the preprocessing pipeline throughput. We found that the throughput speed-up of 10000 samples is almost double the speed-up of 5000 samples. It seems like a larger dataset size may lead to a higher speed-up. So, we were thinking that does every operation follows this pattern or only certain operations can have increasing throughput with increasing dataset size, and then did experiments about the throughput speed-ups on different dataset sizes of all operations in the audio preprocessing pipeline. The results showed that only the audio decoding step can have a great increase in speed-up for larger dataset sizes. But for transformation, augmentation and LMS, the throughputs always stay at a steady level. This indicates that the only audio decoding step can become faster and faster when the dataset size grows.

Conclusion

In our work, we have built up a collection of common datasets and their preprocessing pipelines for different machine-learning tasks. For the audio dataset LibriSpeech, we have done experiments about the trade-offs between throughput speed-ups and storage overhead, and dataset sizes. We have found that speed-ups keep increasing when more and more operations are divided into the offline part. Only the audio decoding step can become faster and faster when the dataset size grows.

Future works

In the near future, we still want to find the optimal preprocessing strategy by profiling only a small part of the original enormous dataset. The second thing is that besides the audio dataset, we must expand the range of our experiments on other datasets and ML tasks. Finally, we need to implement our goal of building an automatic system that decides the optimal strategy of a preprocessing pipeline.

Final Blog: FSA - Benchmarking Fail-Slow Algorithms

Wed, 14 Aug 2024 00:00:00 +0000

Introduction

Hello! I hope you’re enjoying the summer as much as I am. I’m excited to join the SOR community as a 2024 contributor. My name is Xikang Song, and I’m thrilled to collaborate with mentors Ruidan Li and Kexin Pei on the FSA-Benchmark project. This project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. Throughout this journey, we tested a broad range of algorithms, from traditional approaches to state-of-the-art techniques, using a robust evaluation system to compare their effectiveness.

In the first half of the project, I focused on implementing and testing different machine learning models for detecting disks at high risk of fail-slow anomalies. This involved setting up initial models such as the Cost-Sensitive Ranking Model and Multi-Prediction Models, and beginning to explore LSTM networks for analyzing input disk data.

In the second half, I built upon this foundation by refining the evaluation processes, exploring advanced models like PatchTST, and investigating the potential of large language models (LLMs) for detecting subtle fail-slow conditions in storage systems. This blog post will summarize the key achievements, findings, and comparisons with baseline models from this phase.

Key Achievements

Comprehensive Benchmarking and Evaluation:
- I extended the benchmarking framework to evaluate multiple algorithms across 25 different data clusters on PERSEUS. This process involved generating and analyzing heatmaps that visualized the precision and recall of each model under various settings, providing a clear understanding of each approach’s strengths and limitations.
Exploration of Advanced Machine Learning Models:
- LSTM Model: I implemented the Long Short-Term Memory (LSTM) model, specifically designed for sequential data, to capture temporal dependencies in disk performance metrics. This model was used to predict potential fail-slow anomalies by analyzing historical data. Using Mean Squared Error (MSE) as a risk indicator, the LSTM model outperformed baseline approaches like the Cost-Sensitive Ranking Model and Multi-Prediction Models, especially in clusters where latency patterns between faulty and normal disks were distinct, such as in Cluster_P. This resulted in a higher precision and fewer false positives. However, in clusters with more complex and overlapping data distributions, like Cluster_L, the LSTM model’s performance diminished, similar to that of the baseline models
- PatchTST Model: I also introduced and evaluated the PatchTST model, which is built on a transformer-based architecture known for its ability to handle sequential data by capturing long-range dependencies and intricate temporal patterns. Unlike traditional models, PatchTST processes time series data in segments or “patches,” enhancing its ability to predict disk behavior over extended periods. Like the LSTM model, PatchTST uses outlier MSE values to assess disk risk. In clusters with a clear separation between faulty and normal disks, PatchTST outperformed baseline models by effectively identifying faulty patterns. However, similar to the LSTM model, PatchTST encountered difficulties in clusters with significant data overlap, such as Cluster_L.
Investigation into Large Language Models (LLMs):
- I explored the use of GPT-4-o-mini for fail-slow detection. While large language models (LLMs) showed potential, particularly in reducing false positives and improving precision over baseline models, they did not consistently outperform specialized models like LSTM and PatchTST in this context. LLMs struggled with recall, especially as thresholds increased, revealing the challenges of adapting LLMs to time series data. This limitation arises because LLMs are primarily trained for natural language generation tasks, not for analyzing time series data. As a result, their ability to fully capture anomalies is limited. To improve their effectiveness, we need to develop methods that help LLMs better understand time series data. For example, incorporating statistical information about each disk’s performance could enhance LLMs’ understanding, leading to better precision in fail-slow detection.

Conclusion and Future Work

The work in this project demonstrated that while advanced machine learning models like LSTM and PatchTST offer significant potential for detecting fail-slow conditions, challenges remain in ensuring consistent performance across diverse clusters. Compared to baseline models, these advanced approaches generally provided better precision and recall, especially in clusters with distinct data patterns between faulty and normal disk performance time series. However, the persistent difficulties in more complex clusters indicate the need for further refinement.

Moving forward, future work will focus on refining these models, particularly in improving their performance in challenging clusters like Cluster_L. Additionally, I plan to further explore techniques such as prompt engineering for LLMs to better tailor them for time series analysis and fail-slow detection tasks.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the FSA_BENCHMARK GitHub Repository.
Jupyter Notebook: A notebook to reproduce the experiments and benchmarks on Chameleon: Chameleon Experiment Notebook.
Final Report: Comprehensive algorithm performance evaluation for all methods in FSA-Benchmarking Final Report.

Final Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago. This summer I worked on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified Python simulator and evaluating the existing cache-eviction policy and ML-based prefetcher under this simulator. Through this projects, we make the following contributions and get several insights that can share with the community:

We built up a simulator to evaluate various prefetchers under a unified framework, under the production level traces from Alibaba, Microsoft Research, and Tencent.
Through the evaluation, we discover several downsides that existing heuristic-based prefetchers encounter.
We draw several insights that can guide the future prefetchers’ design.

Methodology

In the first half of the SoR project, I mainly focus on the simulator building of I/O prefetcher. The simulator should mimic the real OS-level prefetching as much as possible. First, we develop a mechanism that mimics the users sending I/O requests to the underlying systems. Then, we simulate the process of page division, and memory management inside the systems. Finally, we designed a sleep-based mechanism to mimic the I/O latency of backend storage. The outcome system can eventually simulate the data path of I/O request and prefetching of real systems, and collect the crucial metrics such as hit rate, total prefetched data, bandwidth usage, prefetch accuracy, total cache eviction, etc.

In the second half of the SoR project, I concentrate on the evaluation of existing prefetchers. First, we surveyed the existing state-of-the-art prefetchers and divided them into two categories: (1) Heuristic-based prefetchers and (2) ML-based prefetchers. Next, for each category, we picked several representative prefetchers and implemented them within our simulator. Then, we evaluated those prefetchers using the production-level over 600 traces from Alibaba, Tencent, and Microsoft Research. Finally, we analyzed the performance of those prefetchers and discovered some interesting insights that might guide the future prefeters’s design.

Finally, based on the achievements of the SoR project, I will continue involving this interesting project with Prof. Haryadi S. Gunawi. We are leveraging the current insights we get to build an I/O prefetcher that mitigates the downsides of existing prefetchers.

Insights

Based on our experiments on the existing prefetchers, we would like the share the following insights:

Heuristic-based prefetchers, including Linux Readahead and Stride prefetcher, rely on strict pre-fined rules and detect straightforward access patterns. However, those prefetchers are too conservative to recognize the increasingly complex access patterns. Especially, in real-world applications, sequential accesses are interweaved with random accesses, leading to a next-level complexity that makes it difficult for Linux Readahead and Stride prefetchers to recognize.
Offline learning-based prefetchers learn the access patterns by training machine learning models on pre-collected historical access patterns. Blessed by the representational power of machine learning, these prefetchers excel at recognizing complex access patterns. However, their effectiveness is constrained by their dependence on the patterns encountered during offline training, making them less adaptable to previously unseen patterns in online scenarios. Moreover, due to not relying on the pre-defined rule of prefetching, Offline learning-based prefetchers are more prone to prefetch useless data, which causes cache pollution and extra pressure on backend storage.
We argue that a good prefetcher under nowadays complex and changing workload should have three properties: (1) Complexity-Recognition: which means the prefetcher should be able to recognize the complex access pattern of a complex workload. (2) Reliability: means the prefetcher should reduce its possibility to prefetch using less data and cause cache pollution. (3) Adaptability: means the prefetcher should adapt itself to the changing workload.

Future Works

Based on the above insights, we are now designing our own prefetchers that can mitigate the downsides of existing prefetchers. We will make our code public after we finalize our design.

Conclusion

Through the SoR project, I delved into the research area of I/O prefetching by reproducing the related works, characterizing their performance, and designing our own prefetcher. We contribute to the community with a comprehensive simulator, evaluation results of related prefetchers, and insights that can guide the future prefetchers’ design. In the future, I will continue working on the research area of prefetcher and keep making contributions.

Mid Term Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago, currently working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified python simulator and evaluating the existing chache-eviction and ML-Based prefetcher under this simulator.

Motivation

Existing prefetching algorithms can be categorized into (a) heuristic-based methods such as the Linux lookahead prefetcher and (b) machine learning-based methods like Long Short Term Memory (LSTM) models. However, there is a research gap in comprehensively comparing all existing ML solutions, such as Leap and LSTM Prefetcher, under a consistent evaluation setup. To ensure the fairness of evaluations, it is essential to integrate all baselines and our prefetcher into a homogeneous evaluation environment. Additionally, there is a need to evaluate cache eviction algorithms under prefetching scenarios.

Therefore, in this project, we aim to build a fair simulator, deploy state-of-the-art prefetchers and cache eviction algorithms onto this platform, and then evaluate them using comprehensive metrics. The state-of-the-art prefetchers we consider include Pythia (MICRO'21), SGDP (arXiv), and the Markov-Chain prefetcher. For cache eviction algorithms, we consider S3FIFO (SOSP'23) and SIEVE (NSDI'24). Our focus is on implementing these algorithms on our simulator and evaluating their performance using block storage datasets from Alibaba, Tencent, and MSR. Besides evaluating the prefetchers and eviction algorithms individually, we also aim to combine prefetchers with cache eviction algorithms to test overall performance.

Current Progress

In the past one and a half months, I have focused on (1) implementing our Python simulator and (2) deploying state-of-the-art prefetchers and cache eviction algorithms on this simulator. The implementation phase is now complete. The detailed progress is as follows:

The python simulator of evaluating both ML-based or heuristic-based prefetchers and cache eviction are done.
Evaluations metrics collection, such as hit rate, total prefetched data, prefetch overhead, prefetch accuracy are implemented on the simulator.
Two ML-based prefetchers, SGDP, Pythia and Markov-Chain are deployed on the simulator. SGDP is a graphed neural network based prefetcher, and Pythia is a reinforment learning based prefetcher.
State-of-the-art heuristic based eviction algorithms are implemented in the simulator, including S3FIFO and SIEVE.

With the simulator and state-of-the-art ML-based prefetchers and eviction algorithms in place, the next steps are to (1) organize a large-scale dataset (including over 600 traces from real storage servers) for testing performance and (2) evaluate the implemented prefetchers and eviction algorithms on this dataset. Finally, I will analyze the evaluation results and provide insights from the experimental outcomes. For the ML-based prefetchers, I will analyze both ML-related metrics such as accuracy and F1-score, and system metrics such as hit rate and various overheads.

Challenges

The biggest challenge is implementing existing prefetchers correctly and fairly. Since some state-of-the-art prefetchers are designed for DRAM prefetching, adapting them for SSD prefetching in the simulator is challenging. Additionally, the lack of source code for some works makes it difficult to reproduce their algorithms accurately based solely on their paper descriptions.

Halfway Blog: FSA: Benchmarking Fail-Slow Algorithms

Tue, 23 Jul 2024 00:00:00 +0000

Introduction

Hi, I’m Xikang Song, a 2024 SoR contributor to the project, working with mentors Ruidan Li and Kexin Pei. Our FSA-Benchmark project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. We will benchmark a range of machine learning algorithms, from traditional to advanced methods, and compare the results using a comprehensive evaluation system. This will provide a clear view of how machine learning impacts critical error detection in RAID systems.

Motivation

Fail-slow issues in storage systems , where a disk operates at a significantly reduced speed without completely failing, are subtle and can manifest as consistently higher latency compared to peer disks or recurrent abnormal latency spikes. These issues are challenging to detect but can significantly degrade overall system performance over time. Fixed thresholds are ineffective because latency distributions vary across different clusters, leading to thresholds that are either too low or too high, resulting in numerous false alerts. Therefore, we are enthusiastic about using machine learning models to analyze disk performance data. Machine learning algorithms can deeply learn the trends in the data, providing better detection capabilities.

Current Progress and Challenges

Algorithm Implementation:

Cost-Sensitive Ranking Model: Inspired by the paper “Improving Service Availability of Cloud Systems by Predicting Disk Error” presented at the USENIX ATC ‘18 conference, this model ranks disks based on fail-slow risk.
Multi-Prediction Models: Drawing from “Improving Storage System Reliability with Proactive Error Prediction” presented at the USENIX ATC ‘17 conference, this approach uses multiple traditional machine learning models to evaluate disk health using diverse features. Various models were tested, with the Random Forest classifier proving most effective.
LSTM Model: This model employs Long Short-Term Memory (LSTM) networks, trained on the first day’s data for each cluster and evaluated on data spanning all days. It captures temporal dependencies to accurately predict fail-slow anomalies over time.

Comprehensive Evaluation:

Collected outputs from all algorithms on Chameleon for Perseus data A to Y (25 clusters).
Parsed the outputs through a comprehensive evaluation system, recording the true/false positives/negatives.
Plotted heat maps to show precision and recall with different look-back days and alert threshold settings.
Compared the performance across different clusters to draw conclusions.

Packaging Code:

Packaged all the code into a Trovi Jupyter notebook, including the Chameleon server setup, to provide clear steps for running the code and reproducing the experiments. All algorithm testing and result parsing can be easily done here.

Challenges

Initially, I was unsure how to evaluate the performance of different algorithms. Ruidan Li provided comprehensive guidance on collecting all the results uniformly and parsing them to gather true/false positives/negatives. This approach enabled us to derive meaningful metrics and plot heatmaps for precision and recall. I learned the scientific method of benchmarking performance, and I am grateful for the guidance.

Future Steps

Further Investigation of Advanced Algorithms

We plan to explore advanced algorithms such as PatchTST. This will involve systematically collecting outputs and conducting comprehensive benchmarking to assess their performance in identifying fail-slow anomalies.

Transition to Large Language Models (LLMs)

Recognizing the limitations of traditional machine learning methods, we intend to transition to utilizing Large Language Models (LLMs). LLMs have demonstrated superior capabilities in understanding complex patterns and making accurate predictions. We anticipate that incorporating LLMs into our analysis will enhance our ability to detect and predict fail-slow anomalies more accurately, leading to better overall system reliability.

Enabling VAA Execution: Environment and VAA Preparation and/or Reproducibility for Dynamic Bandwidth Allocation (CONCIERGE)

Sat, 20 Jul 2024 00:00:00 +0000

Hi there!

I am Rafael Sinjunatha Wulangsih, a Telecommunication Engineering graduate from the Bandung Institute of Technology (ITB), Bandung, Indonesia. I’m currently contributing to the “EdgeRep: Reproducing and benchmarking edge analytic systems” project under the mentorship of Yuyang (Roy) Huang and Prof. Junchen Jiang. You can find more details about the project proposal here.

This project addresses the challenges posed by the massive deployment of edge devices, such as traffic or security cameras, in smart cities and other environments. In the previous Edgebench project, the team proposed a solution to dynamically allocate bandwidth and compute resources to video analytic applications (VAAs) running on edge devices. However, that project was limited to a single VAA, which may not represent the diverse applications running on edge devices. Therefore, the main goal of this project, “EdgeRep,” is to diversify the VAAs running on edge devices while utilizing a solution similar to that of the Edgebench project. EdgeRep aims to reproduce state-of-the-art self-adaptive VAAs (with seven candidates) and maintain self-adaptation in these video analytics pipelines. We will implement it ourselves if the video analytics applications do not support self-adaptation.

Halfway Through GSOC: Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Sat, 20 Jul 2024 00:00:00 +0000

Hello, I’m Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University. I’m currently working on the AIIO / Graph Neural Network project under the guidance of Bin Dong and Suren Byna. Our project focuses on enhancing the AIIO framework to automatically diagnose I/O performance bottlenecks in high-performance computing (HPC) systems using Graph Neural Networks (GNNs).

Project Overview

Our primary goal is to tackle the persistent issue of I/O bottlenecks in HPC applications. Identifying these bottlenecks manually is often labor-intensive and prone to errors. By integrating GNNs into the AIIO framework, we aim to create an automated solution that can diagnose these bottlenecks with high accuracy, ultimately improving the efficiency and reliability of HPC systems.

Progress and Challenges

Over the past few weeks, my work has been centered on developing a robust data pre-processing pipeline. This pipeline is crucial for converting raw I/O log data into a graph format suitable for GNN analysis. The data pre-processing involves extracting relevant features from Darshan I/O logs, which include job-related information and performance metrics. One of the main challenges has been dealing with the heterogeneity and sparsity of the data, which can affect the accuracy of our models. To address this, we’ve focused on using correlation analysis to identify and select the most relevant features, ensuring that the dataset is well-structured and informative for GNN processing.

We’ve also started constructing the GNN model. The model is designed to capture the complex relationships between different I/O operations and their impact on system performance. This involves defining nodes and edges in the graph that represent job IDs, counter types, and their values. We explored different graph structures, including those that focus on counter types and those that incorporate more detailed information. While more detailed graphs offer better accuracy, they also require more computational resources.

Current Achievements

Data Pre-processing Pipeline: We have successfully developed and tested the pipeline to transform Darshan I/O logs into graph-structured data. This was a significant milestone, as it sets the foundation for all subsequent GNN modeling efforts.
GNN Model Construction: The initial version of our GNN model has been implemented. This model is now capable of learning from the graph data and making predictions about I/O performance bottlenecks.
Correlation Analysis for Graph Structure Design: We have used correlation analysis on the dataset to understand the relationships between I/O counters. This analysis has been instrumental in designing a more effective graph structure, helping to better capture the dependencies and interactions critical for accurate performance diagnosis.

Training for Different Graph Structures: We are currently training our model using various graph structures to determine the most effective configuration for accurate I/O performance diagnosis. This ongoing process aims to refine our approach and improve the model’s predictive accuracy.

Next Steps

Looking ahead, we plan to focus on several key areas:

Refinement and Testing: We’ll continue refining the GNN model, focusing on improving its accuracy and efficiency. This includes experimenting with different graph structures and training techniques.
SHAP Analysis: To enhance the interpretability of our model, we’ll incorporate SHAP (SHapley Additive exPlanations) values. This will help us understand the contribution of each feature to the model’s predictions, making it easier to identify critical factors in I/O performance.
Documentation and Community Engagement: As we make progress, we’ll document our methods and findings, sharing them with the broader community. This includes contributing to open-source repositories and engaging with other researchers in the field.

This journey has been both challenging and rewarding, and I am grateful for the support and guidance from my mentors and the community. I look forward to sharing more updates as we continue to advance this exciting project.

Halfway Through GSOC: My Experience and Learnings

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m Qianru, and this is my mid-term blog post for the 2024 Google Summer of Code. I am working on the BenchmarkST project, focusing on benchmarking gene imputation methods in spatial transcriptomics. My goal is to create a comprehensive, reproducible platform for evaluating these methods across various datasets and conditions.

In this post, I will share some of the progress I have made so far, the challenges I have faced, and how I overcame them. I will also highlight some specific accomplishments and what I plan to do next.

Achievements:

Developed the Python Package: I created the “Impeller” Python package, which includes tools for downloading example data, processing it, and training models. This package aims to standardize gene imputation tasks in spatial transcriptomics.
Example Data Integration: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Benchmarking Framework: Established a framework for objective comparison of different gene imputation methodologies.

Python Package: Installation and Usage

You can install the package using pip:

pip install Impeller

Download Example Data

from Impeller import download_example_data
download_example_data()

Load and Process Data

from Impeller import load_and_process_example_data, val_mask, test_mask, x, original_x = load_and_process_example_data()

Train Model

from Impeller import create_args, train args = create_args(),test_l1_distance, test_cosine_sim, test_rmse = train(args, data, val_mask, test_mask, x, original_x)

Challenges:

Reproducing the results of various gene imputation methods was not an easy task. I faced several challenges along the way:

Lack of Standardized Data: Some methods had incomplete or missing code, making it difficult to reproduce their results accurately.
Reproducibility Issues: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Resource Limitations: Running large-scale experiments required significant computational resources, which posed constraints on the project timeline.

Future Work:

Moving forward, I plan to:

Extend the package’s functionalities to include more datasets and imputation methods.
Enhance the benchmarking framework for more comprehensive evaluations.
Collaborate with other researchers to validate and improve the package’s utility in the bioinformatics community.

I hope you found this update informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and support!

Mid Term Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

Motivation

Our research project is based on the context of Deep Neural Networks. To train a DNN, we first need a large amount of data. All raw data must be preprocessed by a data preprocessing pipeline, which is specific to different ML tasks. As usual, in a preprocessing pipeline, the data must be loaded from the disk and converted to the correct format, transformed and augmented. And then, it can be fed into the training stage. In common ML training tasks and datasets, the data preprocessing stage can consume almost 65% of the total training time. However, compared with the fast development of computing hardware including GPUs and TPUs, the speed of data preprocessing pipelines has not been improved by a lot and cannot keep up with these hardware innovations, which leads to a bottleneck in the efficiency of Deep Neural Network training.

The bottlenecks can be divided into 2 categories: the data side and the computation side. The data side bottleneck is mainly caused by the data transfer in the system, including data fetching, I/O bound, huge size of data, and complex data format. However, the computation side bottleneck can always happen during data preprocessing operations and data shuffling. For distributed Machine Learning training systems, gathering the distributed data can also lead to the computation side bottleneck.

Current Progress

In order to improve the efficiency of the machine learning preprocessing pipeline, we first need to understand and document the preprocessing workflows commonly used in machine learning, including pipelines of Natural Language Processing, Computer Vision, and Audio datasets. As a result, for the past month, we have built up a collection of common datasets for different machine learning tasks. The dataset types include NLP, CV, Audio, Linear Regression, Video and LiDAR. The machine learning job types are collected based on the dataset types, such as sentiment analysis for NLP, and image classification for CV. The data has either a structured or unstructured format. In addition, our collection contains the following attributes:

Data/Sample size
Typical preprocessing operations
Preprocessing difficulty: hard/easy
Input splittable
Output reusable
CPU/GPU/IO Bound
Dataset and preprocessing links.

By collecting all this data, we can gain an overview of all common preprocessing pipelines in the current machine learning research field, and build up a solid basis for the next phase of our project, which requires hard work on benchmark profiling. For example, for the Audio datasets, we focus on the LibriSpeech dataset. It contains 1000 hours of speech sampled at 16kHz, making it one of the largest publicly available datasets for speech recognition tasks. The typical preprocessing steps of the LibriSpeech dataset include feature extraction, label to integer conversion, and padding.

Challenges

During the first phase of the project, I met a lot of challenges as I had not been exposed to topics similar to this project. The first big problem was that I needed to learn the concepts of some machine learning tasks from scratch, such as NLP, so that I could have a better understanding of the common datasets and pipelines. Also, I needed to deeply review a lot of different preprocessing pipelines for each machine learning task, to make the table more comprehensive.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Tue, 25 Jun 2024 00:00:00 +0000

Hello, I’m Peiran Qin, a first-year Pre-Doctoral student in Computer Science at the University of Chicago. In this summer I will focus working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. This is my proposal.

Caching and prefetching are integral components of modern storage systems, aimed at reducing I/O latency by utilizing faster but less dense memory for storing data that is accessed frequently. Traditional prefetching strategies, which primarily rely on heuristic-based methods, often fall short in performance, particularly in complex scenarios. To address the complex scenarios, in recent years, machine learning solutions have emerged as a promising alternative, offering the ability to learn and predict complicated data access patterns. However, each existing ML prefetcher may bias toward different scenarios and distinct evaluation metrics. There is still a necessity to evaluate state-of-the-art machine learning based literatures comprehensively and fairly under an aligned evaluation framework and extensive performance metrics. Therefore, It becomes the motivation for me to spend my summer on this interesting project!

Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Fri, 14 Jun 2024 00:00:00 +0000

Hello, I am Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University, specializing in Artificial Intelligence. This summer, I will be working on the project AIIO / Graph Neural Network under the mentorship of Bin Dong and Suren Byna.

High-Performance Computing (HPC) applications often face performance issues due to I/O bottlenecks. Manually identifying these bottlenecks is time-consuming and error-prone. My project aims to enhance the AIIO framework by integrating a Graph Neural Network (GNN) model to automatically diagnose I/O performance bottlenecks at the job level. This involves developing a comprehensive data pre-processing pipeline, constructing and validating a tailored GNN model, and rigorously testing the model’s accuracy using test cases from the AIIO dataset.

Through this project, I seek to provide a sophisticated, AI-driven approach to understanding and improving I/O performance in HPC systems, ultimately contributing to more efficient and reliable HPC applications.

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Lihaowen (Jayce) Zhu, currently pursuing my Master of Science in Computer Science at the University of Chicago. I will be spending my summer working on the project FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning under the mentorship of Yuyang (Roy) Huang and Swami Sundararaman, my proposal.

The landscape of machine learning (ML) is profoundly impacted by the initial stages of feature engineering and data preprocessing. This phase, critical for the success of ML projects, is often the most time-consuming, representing about 80% of the effort in typical ML workflows. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

FSA: Benchmarking Fail-Slow Algorithms

Wed, 12 Jun 2024 00:00:00 +0000

Hi everyone! I’m Xikang, a master’s CS student at UChicago. As a part of FSA benchmarking Project, I’m thrilled to be a contributor to OSRE 2024, collaborating with Kexin Pei, the assistant Professor of Computer Science at Uchicago and Ruidan, a talented PhD student at UChicago.

This summer, I will focus on integrating some advanced ML into our RAID slowdown analysis. Our aim is to assess whether LLMs can effectively identify RAID slowdown issues and to benchmark their performance against our current machine learning algorithms. We will test the algorithms on Chameleon Cloud and benchmark them.

Additionally, we will explore optimization techniques to enhance our pipeline and improve response quality. We hope this research will be a start point for future work, ultilizing LLMs to overcome the limitations of existing algorithms and provide a comprehensive analysis that enhances RAID and other storage system performance.

I’m excited to work with all of you and look forward to your suggestions. if you are interested, Here is my proposal

ML-Powered Problem Detection in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I am Syed Mohammad Qasim, a PhD candidate in Electrical and Computer Engineering at Boston University. I will be spending my summer working on the project ML-Powered Problem Detection in Chameleon under the mentorship of Ayse Coskun and Michael Sherman.

Currently, Chameleon Cloud monitors sites at the Texas Advanced Computing Center (TACC), University of Chicago, Northwestern University, and Argonne National Lab. They collect metrics using Prometheus at each site and feed them all to a central Mimir cluster. All the logs go to a central Loki, and Grafana is used to visualize and set alerts. Chameleon currently collects around 3000 metrics. Manually reviewing and setting alerts on them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service that can augment the existing alerting framework.

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sun, 09 Jun 2024 00:00:00 +0000

Hello! My name is Qianru, and I will be working on a project to improve spatial transcriptomics during Google Summer of Code 2024. My project, Benchmarking Gene Imputation Methods for Spatial Transcriptomics, is mentored by Ziheng Duan and Cormac Flanagan. The goal is to create a standard platform to evaluate methods for filling in missing gene data, which is a big challenge in spatial transcriptomics. My proposal can be viewed here!

Spatial transcriptomics lets us see where genes are active in tissues, giving us insight into how cells interact in their natural environment. However, current methods often miss some gene data, making it hard to get a complete picture. Gene imputation can help fill in these gaps.

My project will:

Create a benchmark dataset to standardize gene imputation tasks across different platforms, species, and organs.

Compare various gene imputation methods to see how well they work in different scenarios.

Develop a user-friendly Python package with tools for gene imputation to help researchers improve their data.

I’m excited to contribute to this project and help advance the field of spatial transcriptomics by making data analysis more accurate and comprehensive.

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Mon, 03 Jun 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (primary contact), Swami Sundararaman
Contributor(s): Lihaowen (Jayce) Zhu

In the realm of machine learning (ML), preprocessing of data is a critical yet often underappreciated phase, consuming approximately 80% of the time in common ML tasks. This extensive time consumption can be attributed to various challenges encountered from both data and computation perspectives.

From the data side, one significant challenge is the slow retrieval of data from data lakes, which are storage repositories that hold a vast amount of raw data in its native format. However, the process of extracting this data can be slow, causing computation cycles to wait for data arrival and leading to delays in the entire preprocessing phase. Furthermore, the size of the data often exceeds the memory capacity of standard computing systems. This is a frequent occurrence in ML, as datasets are typically large and complex. Handling such large datasets requires sophisticated memory management techniques to ensure efficient preprocessing without overwhelming the system’s memory.

On the computation side, a naive solution to data operations, especially aggregation, often leads to inefficiencies. These operations may require grouping a large chunk of data as a prerequisite before performing any actual computation. This grouping, without careful configuration and management, can trigger serious data shuffling, leading to extensive remote data movement when the data is distributed across various storage systems. Such data movement is not only time-consuming but also resource-intensive.

To mitigate these challenges, there is a pressing need to design better caching, prefetching, and heuristic strategies for data preprocessing. The team aims to significantly reduce the time and resources required for preprocessing by optimizing data retrieval and computational processes.

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

A rolodex for the commonly used dataset and corresponding preprocess operations and expected output formats/types
A Chameleon Trovi package that preprocess the dataset with single-machine preprocessing framework like pandas
A Chameleon Trovi package that preprocess the dataset in an existing distributed computation framework like Hadoop or Spark

ML-Powered Problem Detection in Chameleon

Wed, 06 Mar 2024 16:33:57 -0600

Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage rapid design of software using a wide range of software components, followed by frequent updates that are immediately deployed on the cloud. The complexity of cloud systems along with the component diversity and break-neck pace of development amplify the difficulty in identifying or fixing problems related to performance, resilience, and security. Furthermore, existing approaches that rely on human experts—e.g., methods involving manually-written rules/scripts—have limited applicability to modern CI/CD processes, as they are fragile, costly, and often not scalable. Consequently, there is growing interest in applying machine learning (ML) based methods for identifying vulnerabilities in code, non-compliant or otherwise problematic software, and resilience problems in systems and networks. However, despite some success stories in applying AI for cloud operations (e.g., in resource management), much of cloud operations still rely on human-centric methods, which require updates as the cloud undergoes CI/CD cycles. The goal of this summer project is to explore methods of automation for the Chameleon Cloud to enable faster detection and diagnosis of problems. Overall, the project will contribute to an overarching vision of building an infrastructure that collects and synthesizes cross-layer data from large-scale cloud systems, applying ML-powered methods to automate cloud ops, and, further, making this data available to researchers through coherent APIs and analytics engines.

Currently, Chameleon uses runbooks as manual guides for operational tasks, including routine maintenance and troubleshooting. However, these traditional runbooks often fall short in dynamic and fast-paced CI/CD environments, as they lack the flexibility to adapt to changes in software versions, deployment configurations, and the unique challenges of emerging issues. To overcome these challenges, the project will leverage ML to automate anomaly detection based on telemetry data collected from Chameleon Cloud’s monitoring frameworks. This method will not only facilitate rapid identification of performance anomalies but also enable automated generation of runbooks. These runbooks can then offer operators actionable steps to resolve issues efficiently, thereby making the anomaly mitigation process more efficient. Furthermore, this approach supports the automatic creation of targeted runbooks for newly generated support tickets, enhancing response times and system reliability.

Time-permitting, using a collection of automated runbooks (each targeting a specific problem), we will analyze support tickets, common problems, and their frequency to offer insights and suggestions to help roadmapping for Chameleon Cloud to offer the best return on investment on fixing problems.

A key aspect of this summer project is enhancing the reproducibility of experiments in the cloud and improving data accessibility. We plan to design infrastructures and APIs so that the telemetry data that is essential for anomaly detection and automated runbooks is systematically documented and made available. We also aim to collect and share insights and modules on applying ML for cloud operations, including ML pipelines, data labeling strategies, data preprocessing techniques, and feature engineering. By sharing these insights, we aim to promote best practices and support reproducible experiments on public clouds, thus fostering future ML-based practices within the Chameleon Cloud community and beyond. Time permitting, we will explore applying lightweight privacy-preserving approaches on telemetry data as well.

Topics: Machine Learning, Anomaly Detection, Automated Runbooks, Telemetry Data
Skills:
- Proficiency in Machine Learning: Understanding of ML algorithms for anomaly detection and automation.
- Cloud Computing Knowledge: Familiarity with CI/CD environments and cloud architectures.
- Programming Skills: Proficiency in languages such as Python, especially in cloud and ML contexts.
- Data Analysis: Ability to analyze telemetry data using data analytics tools and libraries.
Difficulty: Hard
Size: Large
Mentors: Michael Sherman

Data leakage in applied ML: reproducing examples of irreproducibility

Wed, 21 Feb 2024 00:00:00 +0000

Topics: applied machine learning, data leakage, reproducibility
Skills: Python, data analysis, machine learning
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Data leakage has been identified as a major cause of irreproducibility of a paper’s findings, when machine learning techniques are applied to problems in science. Data leakage includes errors such as:

pre-processing before splitting into training/test sets
feature selection before splitting into training/test sets
duplicated data points in both training and test sets
temporal leakage (e.g. shuffled K-fold cross validation with temporal data)
group leakage (e.g. shuffled K-fold cross validation with data that has group structure)

and leads to an overly optimistic evaluation of model performance, such that the finding may no longer be the same when the error is corrected.

Despite the seriousness of this problem, data leakage is often not covered in introductory machine learning courses, and many users of machine learning across varied science domains are unaware of it. Even those who have learned “rules” for avoiding data leakage (e.g. “never do feature selection on the test set”) may not understand the reasons for these “rules”, and how important they are for ensuring that the final result is valid and reproducible.

The goal of this project is to create learning materials demonstrating how instances of data leakage invalidate a result. These materials should be easily adoptable by instructors teaching machine learning in a wide variety of contexts, including those teaching a non-CS audience. To achieve this, the project proposes to re-implement published results that have been affected by data leakage, and package these implementations along with supporting material in a format suitable for use in classrooms and by independent learners. For each “irreproducible result”, the “package” should include -

a re-implementation of the original result
an explanation of the data leakage problem affecting the result, with an implementation of a “toy example” on synthetic data
a re-implementation of the result without the data analysis error, to show how the finding is affected
and examples of exam or homework questions that an instructor adopting this package may use to assess understanding.

Writing a successful proposal for this project

A good proposal for this project should include, for at least a few “types” of data leakage mentioned above -

a specific published result that could be used as an exemplar (you may find ideas among the review papers listed here)
a brief description of the details of the experiment that will reproduce that result (e.g. what data is used, what machine learning technique is used, what are the hyperparameters used for training)
and an explanation of why this result is suitable for this use (it uses a publicly available dataset, a machine learning technique that is familiar and accessible to students in an introductory course, the paper has sufficient detail to reproduce the result, etc.)

The contributor will need to create learning materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

To get a sense of the type of code you would be writing, here is an example of a learning module related to data leakage (however, it is not in the format described above): Beauty in the Classroom

Project Deliverables

“Packages” of learning materials for teaching about common types of data leakage
Trovi artifacts for “playing back” each of the “packages”

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sat, 17 Feb 2024 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene imputation, benchmarking, cross-platform/species analysis
Skills:
- Programming Languages:
  - Proficient in Python and/or R, commonly used in bioinformatics.
- Data Analysis:
  - Experience with statistical data analysis and machine learning models.
- Bioinformatics Knowledge (not required but preferred):
  - Proficiency in bioinformatics and computational biology.
  - Familiarity with spatial transcriptomics datasets and platforms.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating multi-platform, multi-species datasets and the complexity of benchmarking gene imputation methods, this project is substantial. It requires extensive data preparation, analysis, and validation phases, making it suitable for a larger time investment.
Mentors: Ziheng Duan (contact person)

Project Idea Description

The orchestration of cellular life is profoundly influenced by the precise control of gene activation and silencing across different spatial and temporal contexts. Understanding these complex spatiotemporal gene expression patterns is vital for advancing our knowledge of biological processes, from development and disease progression to adaptation. While single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to profile gene expression across thousands of cells simultaneously, its requirement for cell dissociation strips away the critical spatial context, limiting our comprehension of cellular interactions within their native environments. Recent strides in spatial transcriptomics have started to bridge this gap by enabling spatially resolved gene expression measurements at single-cell or even sub-cellular resolutions. These advancements offer unparalleled opportunities to delineate the intricate tapestry of gene expression within tissues, shedding light on the dynamic interactions between cells and their surroundings.

Despite these technological advances, a significant challenge remains: the datasets generated by spatial transcriptomic technologies are often incomplete, marred by missing gene expression values due to various technical and biological constraints. This limitation severely impedes our ability to fully interpret these rich datasets and extract meaningful insights from them. Gene imputation emerges as a pivotal solution to this problem, aiming to fill in these missing data points, thereby enhancing the resolution, quality, and interpretability of spatial transcriptomic datasets.

Recognizing the critical importance of this task, there is a pressing need for a unified benchmarking platform that can facilitate the evaluation and comparison of gene imputation methods across a diverse array of samples, spanning multiple sampling platforms, species, and organs. Currently, the bioinformatics and spatial transcriptomics fields lack such a standardized framework, hindering progress and innovation. To address this gap, our project aims to establish a comprehensive gene imputation dataset that encompasses a wide range of conditions and parameters. We intend to reproduce known methods and assess their efficacy, providing a solid and reproducible foundation for future advancements in this domain.

Project Deliverable

A comprehensive, preprocessed benchmark dataset that spans multiple sampling platforms, species, and organs, aimed at standardizing gene imputation tasks in spatial transcriptomics.
An objective comparison of state-of-the-art gene imputation methodologies, enhancing the understanding of their performance and applicability across diverse biological contexts.
A user-friendly Python package offering a suite of gene imputation tools, designed to fulfill the research needs of the spatial transcriptomics community by improving data completeness and reproducibility.

GPEC: An Open Emulation Platform to Evaluate GPU/ML Workloads on Erasure Coding Storage

Thu, 08 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage Systems, Machine Learning, Erasure Coding
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Erasure Coding, Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Meng Wang (primary contact), John Bent

Large-scale data centers store immense amounts of user data across a multitude of disks, necessitating redundancy strategies like erasure coding (EC) to safeguard against disk failures. Numerous research efforts have sought to assess the performance and durability of various erasure coding approaches, including single-level erasure coding, locally recoverable coding, and multi-level erasure coding.

Despite its widespread adoption, a significant research gap exists regarding the performance of large-scale erasure-coded storage systems when exposed to machine learning (ML) workloads. While conventional practice often leans towards replication for enhanced performance, this project seeks to explore whether cost-effective erasure encoding can deliver comparable performance. In this context, several fundamental questions remain unanswered, including: Can a typical erasure-coded storage system deliver sufficient throughput for ML training tasks? Can an erasure-coded storage system maintain low-latency performance for ML training and inference workloads? How does disk failure and subsequent repair impact the throughput and latency of ML workloads? What influence do various erasure coding design choices, such as chunk placement strategies and repair methods, have on the aforementioned performance metrics?

To address these questions, the most straightforward approach would involve running ML workloads on large-scale erasure coded storage systems within HPC data centers. However, this presents challenges for researchers and students due to limited access to expensive GPUs and distributed storage systems, especially when dealing with large-scale evaluations. Consequently, there is a need for a cost-effective evaluation platform.

The objective of this project is to develop an open-source platform that facilitates cheap and reproducible evaluations of erasure-coded storage systems concerning ML workloads. This platform consists of two key components: GPU Emulator: This emulator is designed to simulate GPU performance for ML workloads. Development of the GPU emulator is near completion. EC Emulator: This emulator is designed to simulate the performance characteristics of erasure-coded storage systems. It is still in the exploratory phase and requires further development.

The student’s responsibilities will include documenting the GPU emulator, progressing the development of the EC emulator, and packaging the experiments to ensure easy reproducibility. It is anticipated that this platform will empower researchers and students to conduct cost-effective and reproducible evaluations of large-scale erasure-coded storage systems in the context of ML workloads.

Project Deliverable

Build an EC emulator to emulate the performance characteristics of large-scale erasure-coded storage systems
Incorporate the EC emulator into ML workloads and GPU emulator
Conduct reproducible experiments to evaluate the performance of erasure-coded storage systems in the context of ML workloads
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository with open-source code

LAST: Let’s Adapt to System Drift

Wed, 07 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Computer systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Data Science and Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ray Andrew Sinurat (primary contact), Sandeep Madireddy

The performance of computer systems is constantly evolving, a natural outcome of updating hardware, improving software, and encountering hardware quirks over time. At the same time, machine learning (ML) models are becoming increasingly popular. They are being used widely to address various challenges in computer systems, notably in speeding up decision-making. This speed is vital for a quick and flexible response, essential for meeting service-level agreements (SLAs). Yet, an interesting twist has emerged: like the computer systems they aid, ML models also experience a kind of “aging.” This results in a gradual decline in their effectiveness, a consequence of changes in their operating environment.

The phenomenon of model “aging” is a ubiquitous occurrence across various domains, not limited merely to computer systems. This process of aging can significantly impact the performance of a model, emphasizing the critical importance of early detection mechanisms to maintain optimal functionality. In light of this, numerous strategies have been formulated to mitigate the aging of models. However, the generalizability and effectiveness of these strategies across diverse domains, particularly in computer systems, remain largely unexplored. This research aims to bridge this gap by designing and implementing a comprehensive data analysis pipeline. The primary objective is to evaluate the efficacy of various strategies through a comparative analysis, focusing on their performance in detecting and addressing model aging. To achieve a better understanding of this issue, the research will address the following pivotal questions:

Data-Induced Model Aging: What specific variations within the data can precipitate the aging of a model? Understanding the nature and characteristics of data changes that lead to model deterioration is crucial for developing effective prevention and mitigation strategies.
Efficacy of Aging Detection Algorithms: How proficient are the current algorithms in identifying the signs of model aging? Assessing the accuracy and reliability of these algorithms will provide insights into their practical utility in real-world scenarios.
Failure Points in Detection: In what scenarios or under what data conditions do the aging detection mechanisms fail? Identifying the limitations and vulnerabilities of these algorithms is vital for refining their robustness and ensuring comprehensive coverage.
Scalability and Responsiveness: How do these algorithms perform in terms of robustness and speed, particularly when subjected to larger datasets? Evaluating the scalability and responsiveness of the algorithms will determine their feasibility and effectiveness in handling extensive and complex datasets, a common characteristic in computer systems.

To better understand and prevent issues related to model performance, our approach involves analyzing various datasets, both system and non-system, that have shown notable changes over time. We aim to apply machine learning (ML) models to these datasets to assess the effects of these changes on model performance. Our goal is to leverage more advanced ML techniques to create new algorithms that address these challenges effectively. This effort is expected to contribute significantly to the community, enhancing the detection of model aging and improving model performance in computer systems.

Project Deliverable

Run pipeline on several computer systems and non-computer systems dataset
A Trovi artifact for data preprocessing and model training shared on Chameleon Cloud
A GitHub repository containing the pipeline source code

FSA: Benchmarking Fail-Slow Algorithms

Tue, 06 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ruidan Li (primary contact), Kexin Pei

In the realm of modern applications, achieving not only low but also predictable response times is a critical requirement. Performance instability, even when it amounts to just a few milliseconds of delay, can result in violations of Service Level Objectives (SLOs). Redundancy at the RAID group level provides a layer of protection; however, the early identification of potential slowdowns or failures is paramount in minimizing their impact on overall system latency.

Fail-Slow represents a unique type of fault within storage systems, characterized by the system’s ability to continue functioning while progressively deteriorating – its performance significantly drops below expected levels. Notably, fail-slow conditions are responsible for a considerable share of latency tails. Detecting fail-slow faults is particularly challenging, as they can be easily masked by the normal fluctuations in performance. Consequently, the identification of fail-slow faults is a critical area of research, demanding meticulous attention.

Several strategies have been developed to address the fail-slow issue, yet the question of their broad applicability remains. We plan to implement and assess various existing fail-slow detection algorithms, examining their strengths and weaknesses. Our analysis will concentrate on key questions:

How promptly can the algorithm identify a fail-slow symptom? What methods does the algorithm employ to accurately distinguish fail-slow incidents, thereby minimizing false negatives? Through what approach does the algorithm achieve the right sensitivity level to keep false positives in check?

This evaluation aims to shed light on the effectiveness of current methodologies in detecting fail-slow faults, crucial for enhancing system reliability and performance.

Building upon our evaluation of several fail-slow detection algorithms, our objective is to harness advanced machine learning (ML) models to develop a novel algorithm. This initiative seeks to address and potentially compensate for the identified weaknesses in existing methodologies. By focusing on the critical aspects of early detection, accurate differentiation, and optimal sensitivity, we aim to create a solution that reduces both false negatives and false positives, thereby enhancing overall system reliability. This approach represents a strategic effort to not only advance the current state of fail-slow detection but also to contribute significantly to the resilience and performance of storage systems.

Project Deliverable

A Trovi artifact for the existing Fail-Slow detection algorithms on Chameleon Cloud
A GitHub repository containing the full evaluation result
A Google Colab notebook for quick replay

EdgeRep: Reproducing and benchmarking edge analytic systems

Fri, 02 Feb 2024 00:00:00 +0000

Topics: video analytics, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Junchen Jiang

Project Idea Description

With the flourishing of ideas like smart cities and smart manufacturing, a massive number of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, etc.) are deployed and connected to the network. These devices collect and analyze data across space and time, aiding stakeholders like city governments and manufacturers in optimizing their plans and operations. However, the sheer number of edge devices and the large amount of communication among the devices and central servers raises significant challenges in how to manage and schedule resources. This includes network bandwidth between the devices and computing power on both edge devices and bare metal servers, all to maintain the reliable service capability of running applications.

Moreover, given the limited resources available to edge devices, there’s an emerging trend to reduce average compute and/or bandwidth usage. This is achieved by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This, in turn, introduces further challenges in provisioning and managing the amount of resources available to edge devices. The resource demands of running applications can greatly depend on the input data, which is both dynamic and unpredictable.

Keeping these challenges in mind, the team previously designed and implemented a dynamic resource manager capable of understanding the applications and making decisions based on this understanding at runtime. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Thus, through the OSRE24 project, we aim to:

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

Project Deliverable

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

FEP-Bench: Benchmarks for understanding featuring engineering and preprocessing bottlenecks

Fri, 02 Feb 2024 00:00:00 +0000

Topics: storage system, scheduling, distributed system, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Swami Sundararaman

Project Idea Description

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Daniar H. Kurniawan (primary contact), Haryadi Gunawi

The contemporary landscape of high-performance servers, particularly those designed for data centers and AI/ML training, prominently features solid-state drives (SSDs) and spinning disks (HDDs) as primary storage devices. These components play a crucial role in shaping overall system performance, underscoring the importance of addressing and minimizing Input/Output (I/O) latency. This is particularly crucial given the widespread adoption of hybrid storage systems, where caching and prefetching strategies are instrumental in optimizing storage performance. Caching involves using faster but less dense memory to store frequently accessed data, while prefetching aims to reduce latency by fetching data from slower memory to cache before it is needed. Although both caching and prefetching present valid challenges, our primary emphasis is on the prefetching problem due to the inherent difficulty in predicting future access.

Traditional prefetchers, dating back 1-2 decades, heavily rely on predefined rules for prefetching based on LBA access sequences, limiting their adaptability to complex scenarios. For instance, the read-ahead prefetcher is confined to prefetching the next data item within a file for faster sequential access. Addressing this limitation, recent advancements include learning-based methods, such as Long Short-Term Memory (LSTM) techniques like DeepPrefetcher and Delta LSTM, which model the LBA delta to cover a broader range of LBAs. However, they are still struggling to achieve high accuracy when the workload pattern changes drastically. Although there are some sophisticated prefetchers capable of learning complex I/O access patterns using Graph structure, they face challenges in their deployment due to the computational cost.

In this project, our goal is to provide an end-to-end data science pipeline to empower the research on ML-based prefetchers. We believe that this pipeline is crucial for fostering active collaboration between the ML community and storage systems researchers. This collaboration aims to optimize existing ML-based prefetching solutions. Specifically, we will provide the dataset for training/testing and some samples of ML-based models that can further be developed by the community. Furthermore, we will also provide a setup for evaluating the ML model when deployed in storage systems.

Project Deliverable

Compile I/O traces from various open traces and open systems.
Develop a pipeline for building ML-based prefetching solutions.
Build a setup to evaluate the model in a real hybrid storage system.
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository

[Mid-term] Capturing provenance into Data Science/Machine Learning workflows

Mon, 31 Jul 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have done so far in the project for the noWorkflow package.

The initial weeks

I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms. It was a new subject to me, and I aimed to build a more robust theoretical background in the field. Meanwhile, I took notes in this series of posts.

Then, as planned, I integrated with the current noWorkflow supporters in order get a broader view of the project and their contributions. Additionally, Juliana Freire, João Felipe Pimentel, and I set up a weekly one-hour schedule to keep track of my activities.

Brainstormed opportunities

At the beginning of June, we also met with other project supporters to brainstorm about our initial proposal. From this meeting, we came up with a plan on how technically approach a noWorkflow new feature in Data Science and Machine Learning experimental management.

In this brainstorm, we aligned that Jupyter Notebooks are, by far, the most frequent set up in DS/ML computational experiments. They established themselves as the fundamental artifact by embedding code, text and enabling execution and visualization. Entire experiments are created and kept in Jupyter notebooks until they are sent to production. And the opportunity at hand is to integrate noWorkflow with Jupyter Notebooks. Then, our mid-term goal was adapted from the original plan of only selecting and executing a prototypical ML experiment. We added the goal of paving the way for providing a tagging feature for Notebook cells.

More specifically, DS/ML experimental workflows usually have well-defined stages composed of data reading, feature engineering, model scoring, and metrics evaluation. In our dream space, the user would tag a cell in their experiment, enabling the capture of the tagged metadata into a database. This step integrates the ultimate goal of facilitating comparisons, management, and even causal inference across different trials of a DS/ML experiment.

Current deliverables

So, based on our plans, we create a separate table to store the metadata from cell tagging. This table stores the cell hash codes and information to match the code executed within a cell. As a result, we can store tags and the activation ids of the cells enabling us to identify a cell containing a given stage in a DS/ML experiment.

The second feature implemented was tagging a specific variable. In the same way for a cell, now it is possible to stamp a given variable with a tag, keeping its name, id, and received value in this separated table.

Finally, we worked on displaying the dependencies of a given variable. In this case, by tagging a given variable, we can display the other variables, values, and cells activated in its construction. Then, we can visualize the dependencies that contributed to its final value.

For an overview of current developments, please refer to my fork of the main project.

Challenges

During this period, we had to make choices along the way. For instance, capturing the provenance of cells through tags is a different solution than tagging code chunks in scripts. In this case, we decided to stick with tagging Notebook cells at this moment. We also opted to start storing the metadata to enable comparisons between trials rather than focus on a sophisticated graphic and user-friendly cell tagging system. We also opted to keep this metadata info stored in a separate table in the database.

Next steps

In the second half of the summer, our goal is to integrate these features in order to proceed with comparisons among experiments. Such comparisons would use the tagged variables as the hyperparameters of DS/ML experiments or key variables to assess the experiments, such as errors or scores. As a result, we will be able to compare the results of two trials in a more accurate, and easily reproducible experiment.

FlashNet: Towards Reproducible Data Science for Storage System

Thu, 02 Feb 2023 00:00:00 +0000

The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:

How can we encourage data scientists to look into storage problems?
How can we create a transparent platform that allows such decoupling?
Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?

In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.

In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.

Building FlashNet Platform

Topics: Storage systems, reproducibility, machine learning, continual learning
Skills: C++, Python, PyTorch, Experienced with Machine Learning pipeline
Difficulty: Medium
Size: Large (350 hours)
Mentors: Haryadi S. Gunawi
Contributor(s): Justin Shin, Maharani Ayu Putri Irawan

Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the FlashNet evaluation results from prior works.
Build and improve FlashNet components based on the existing blueprint.
Collect and analyze the FlashNet evaluation results.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Thu, 02 Feb 2023 00:00:00 +0000

A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.

On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.

In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.

Analyze genomics data quality & build exec. time prediction models

Topics: genomics, data analysis, machine learning
Skills: Linux, Python, Matplotlib, Pandas/Numpy, any ML library
Difficulty: Medium
Size: 350 hours
Mentor(s): In Kee Kim
Contributor(s): Charis Christopher Hulu

Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.

Specific tasks:

Acquire basic understanding of genomics data processing & workflow execution (will be guided by the mentor)
Reproduce past analysis & models built by prior members of the project
Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior
Write a brief analysis as to why those features might work
Build ML models for predicting execution time
Package the analysis in the form of Jupyter notebooks
Package the models in a reloadable format (e.g., pickle)